一.StreamGraph生成源码机制解析
0.先说结论
- 用户代码定义阶段:通过
StreamExecutionEnvironment
创建环境,调用map
、filter
、process
等算子定义逻辑数据流,这些算子会被封装为Transformation
对象并添加到环境的transformations
列表中。 - StreamGraph 生成阶段:调用
env.getStreamGraph()
触发生成流程,核心由StreamGraphGenerator
完成:- 遍历所有
Transformation
,根据类型选择对应的TransformationTranslator
- 通过转换器将逻辑转换映射为
StreamNode
(物理节点)和StreamEdge
(数据流边) - 构建完整的
StreamGraph
,包含节点、边、资源配置等执行所需信息
- 遍历所有
生成StreamGraph的流程:
StreamExecutionEnvironment.getStreamGraph
:调StreamGraphGenerator.generate
- ->
StreamGraphGenerator.generate
:遍历每个算子节点去调用transoform
生成StreamGraph - ->
transform
:根据当前节点去获取对应的转换器translator
,并调用translate
注册转换器去转换 - ->
translate
:由translate
根据执行模式选择转换策略->translator实现类.translateForStreaming
- ->例如
OneInputTransformationTranslator.translateForStreaming
再调父类AbstractOneInputTransformationTranslator
的translateInternal
去真正转换,对上下游节点转为StreamNode,然后加一条边StreamEdge
- ->例如
生成transformation添加到List<Transformation<?>> transformations的情况
map|flatMap|filter|process
算子 ->transform
->doTransform
-> env的addOperator
添加节点到transformations
assignTimestampsAndWatermarks
-> env的addOperator
添加节点到transformations
addSink
-> env的addOperator
添加节点到transformations
executeAndCollectWithClient
-> env的addOperator
添加节点到transformations
1.从StreamExecutionEnvironment入手
起初,我们要想执行一个Flink任务,是不是得先创建env对象,然后你去书写map、filter、process等算子,最终,交给env.execute()才能执行,那么我们先从env下手,
StreamExecutionEnvironment
有getStreamGraph
方法如下
@Internal
public StreamGraph getStreamGraph(String jobName, boolean clearTransformations) {
// 调用的是generate去构造StreamGraph
StreamGraph streamGraph = getStreamGraphGenerator().setJobName(jobName).generate();
if (clearTransformations) {
this.transformations.clear();
}
return streamGraph;
}
// 将各个算子都添加到transformations这个list中,它的调用者是DataStream,具体操作看下面的2.DataStream解析
public void addOperator(Transformation<?> transformation) {
Preconditions.checkNotNull(transformation, "transformation must not be null.");
this.transformations.add(transformation);
}
2.再从StreamGraphGenerator看
那么我们接着看getStreamGraphGenerator().setJobName(jobName).generate()
的generate()
这个时候就到了StreamGraphGenerator
类了
public StreamGraph generate() {
streamGraph = new StreamGraph(executionConfig, checkpointConfig, savepointRestoreSettings);
shouldExecuteInBatchMode = shouldExecuteInBatchMode(runtimeExecutionMode);
configureStreamGraph(streamGraph);
alreadyTransformed = new HashMap<>();
/*transformations是一个list,依次存放了 用户代码里的算子,如map、filter、process等*/
// 遍历每个算子,并执行transform操作,其实DataStream底层就是一个Transformation
for (Transformation<?> transformation: transformations) {
// 执行transform操作,将这些transformation转换成StreamNode和StreamEdge,你可以理解为他会构造一颗树,一个节点通过边连接另外一个节点,最终形成StreamGraph
transform(transformation);
}
final StreamGraph builtStreamGraph = streamGraph;
alreadyTransformed.clear();
alreadyTransformed = null;
streamGraph = null;
return builtStreamGraph;
}
然后我们看transform()
// 对每个 transformation 进行转换,转换成 StreamGraph 中的 StreamNode 和 StreamEdge
// 返回值为该 transform 的 id 集合,通常大小为 1 个(除 FeedbackTransformation)
private Collection<Integer> transform(Transformation<?> transform) {
// 如果已经转换过,则直接返回转换后的结果即可
if (alreadyTransformed.containsKey(transform)) {
return alreadyTransformed.get(transform);
}
LOG.debug("Transforming " + transform);
// 为每个算子配置并行度,不管
if (transform.getMaxParallelism() <= 0) {
// if the max parallelism hasn't been set, then first use the job wide max parallelism
// from the ExecutionConfig.
int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism();
if (globalMaxParallelismFromConfig > 0) {
transform.setMaxParallelism(globalMaxParallelismFromConfig);
}
}
// 检测输出的数据类型,为了触发 MissingTypeInfo 的异常;如果出现了泛型擦除,这里会抛出异常
transform.getOutputType();
// 根据Transformation的类型选择对应的转换器
/*
private static final Map<Class<? extends Transformation>, TransformationTranslator<?, ? extends Transformation>> translatorMap;
static {
@SuppressWarnings("rawtypes")
Map<Class<? extends Transformation>, TransformationTranslator<?, ? extends Transformation>> tmp = new HashMap<>();
tmp.put(OneInputTransformation.class, new OneInputTransformationTranslator<>());
tmp.put(TwoInputTransformation.class, new TwoInputTransformationTranslator<>());
tmp.put(MultipleInputTransformation.class, new MultiInputTransformationTranslator<>());
tmp.put(KeyedMultipleInputTransformation.class, new MultiInputTransformationTranslator<>());
tmp.put(SourceTransformation.class, new SourceTransformationTranslator<>());
tmp.put(SinkTransformation.class, new SinkTransformationTranslator<>());
tmp.put(LegacySinkTransformation.class, new LegacySinkTransformationTranslator<>());
tmp.put(LegacySourceTransformation.class, new LegacySourceTransformationTranslator<>());
tmp.put(UnionTransformation.class, new UnionTransformationTranslator<>());
tmp.put(PartitionTransformation.class, new PartitionTransformationTranslator<>());
tmp.put(SideOutputTransformation.class, new SideOutputTransformationTranslator<>());
tmp.put(ReduceTransformation.class, new ReduceTransformationTranslator<>());
tmp.put(TimestampsAndWatermarksTransformation.class, new TimestampsAndWatermarksTransformationTranslator<>());
tmp.put(BroadcastStateTransformation.class, new BroadcastStateTransformationTranslator<>());
translatorMap = Collections.unmodifiableMap(tmp);
}
* */
@SuppressWarnings("unchecked")
final TransformationTranslator<?, Transformation<?>> translator =
(TransformationTranslator<?, Transformation<?>>) translatorMap.get(transform.getClass());
Collection<Integer> transformedIds;
// 如果找到了对应的转换器,调用translate注册转换器去转换
if (translator != null) {
transformedIds = translate(translator, transform);
} else { // 如果没有对应的转换器,则调用legacyTransform进行处理
transformedIds = legacyTransform(transform);
}
// 将转换的结果缓存起来,并返回
if (!alreadyTransformed.containsKey(transform)) {
alreadyTransformed.put(transform, transformedIds);
}
return transformedIds;
}
(1) 有对应转换器的情况
<1> translate()
private Collection<Integer> translate(
final TransformationTranslator<?, Transformation<?>> translator,
final Transformation<?> transform) {
checkNotNull(translator);
checkNotNull(transform);
// 获取当前转换节点的所以父节点ID,可以理解为是上游吗?
final List<Collection<Integer>> allInputIds = getParentInputIds(transform.getInputs());
// 再次检查当前转换节点是否已经转换过了
if (alreadyTransformed.containsKey(transform)) {
return alreadyTransformed.get(transform);
}
// 确定slot共享组
/* determineSlotSharingGroup方法逻辑
* 如果当前转换已显式设置槽共享组,则使用该设置
* 否则,尝试从父节点继承槽共享组配置
* */
final String slotSharingGroup = determineSlotSharingGroup(
transform.getSlotSharingGroup(),
allInputIds.stream()
.flatMap(Collection::stream)
.collect(Collectors.toList()));
// 创建转换上下文
final TransformationTranslator.Context context = new ContextImpl(
this, streamGraph, slotSharingGroup, configuration);
// 根据执行模式选择转换策略
return shouldExecuteInBatchMode
? translator.translateForBatch(transform, context)
: translator.translateForStreaming(transform, context);
}
<2> 然后调用的translateForStreaming()
TransformationTranslator
是一个接口,他的实现类是SimpleTransformationTranslator
这个抽象父类
抽象父类实现的translateForStreaming
会调用其子类的translateForStreamingInternal
@Override
public Collection<Integer> translateForStreaming(final T transformation, final Context context) {
checkNotNull(transformation);
checkNotNull(context);
// 区分 map之类的转换算子(OneInput) 和 keyby值类的分区算子(partition)
final Collection<Integer> transformedIds =
translateForStreamingInternal(transformation, context);
configure(transformation, context);
return transformedIds;
}
- 无keyby的节点的实现类是
OneInputTransformationTranslator
- 有keyby的节点的实现类是
PartitionTransformationTranslator
@Internal
public final class OneInputTransformationTranslator<IN, OUT>
extends AbstractOneInputTransformationTranslator<IN, OUT, OneInputTransformation<IN, OUT>> {
@Override
public Collection<Integer> translateForBatchInternal(
final OneInputTransformation<IN, OUT> transformation,
final Context context) {
KeySelector<IN, ?> keySelector = transformation.getStateKeySelector();
Collection<Integer> ids = translateInternal(transformation,
transformation.getOperatorFactory(),
transformation.getInputType(),
keySelector,
transformation.getStateKeyType(),
context
);
boolean isKeyed = keySelector != null;
if (isKeyed) {
BatchExecutionUtils.applySortingInputs(transformation.getId(), context);
}
return ids;
}
@Override
public Collection<Integer> translateForStreamingInternal(
final OneInputTransformation<IN, OUT> transformation,
final Context context) {
// 其实调用父类AbstractOneInputTransformationTranslator的translateInternal实现转换
return translateInternal(transformation,
transformation.getOperatorFactory(),
transformation.getInputType(),
transformation.getStateKeySelector(),
transformation.getStateKeyType(),
context
);
}
}
<3> 真正转换的人AbstractOneInputTransformationTranslator
实现情况如下
abstract class AbstractOneInputTransformationTranslator<IN, OUT, OP extends Transformation<OUT>>
extends SimpleTransformationTranslator<OUT, OP> {
protected Collection<Integer> translateInternal(
final Transformation<OUT> transformation,
final StreamOperatorFactory<OUT> operatorFactory,
final TypeInformation<IN> inputType,
@Nullable final KeySelector<IN, ?> stateKeySelector,
@Nullable final TypeInformation<?> stateKeyType,
final Context context) {
checkNotNull(transformation);
checkNotNull(operatorFactory);
checkNotNull(inputType);
checkNotNull(context);
final StreamGraph streamGraph = context.getStreamGraph();
final String slotSharingGroup = context.getSlotSharingGroup();
final int transformationId = transformation.getId();
final ExecutionConfig executionConfig = streamGraph.getExecutionConfig();
/*调用StreamGraph的addOperator 将transformation转换为StreamNode
* */
streamGraph.addOperator(
transformationId, // 1. 节点唯一标识
slotSharingGroup, // 2. slot共享组配置
transformation.getCoLocationGroupKey(), // 3. 同位置分组键
operatorFactory, // 4. 算子工厂
inputType, // 5. 输入数据类型
transformation.getOutputType(), // 6. 输出数据类型
transformation.getName()); // 7. 节点名称
/*这里是Flink带状态计算的关键
* 1.创建键的序列化器(keySerializer),
* 2.然后通过 setOneInputStateKey 配置状态分区规则
* 这样做,能确保相同键的数据路由到同一任务中
* */
if (stateKeySelector != null) {
TypeSerializer<?> keySerializer = stateKeyType.createSerializer(executionConfig);
streamGraph.setOneInputStateKey(transformationId, stateKeySelector, keySerializer);
}
// 这里就是一些配置了,不管
int parallelism = transformation.getParallelism() != ExecutionConfig.PARALLELISM_DEFAULT
? transformation.getParallelism()
: executionConfig.getParallelism();
streamGraph.setParallelism(transformationId, parallelism);
streamGraph.setMaxParallelism(transformationId, transformation.getMaxParallelism());
// 这里返回的是当前转换节点的所有上游节点列表
final List<Transformation<?>> parentTransformations = transformation.getInputs();
// checkState 断言当前转换节点的上游只有1个节点,否则抛异常
checkState(
parentTransformations.size() == 1,
"Expected exactly one input transformation but found " + parentTransformations.size());
/* 添加StreamEdge
* context.getStreamNodeIds 获取父转换的物理节点 ID
* addEdge 添加数据流边,他的参数分别是源节点ID,目标节点ID,分区器索引
* */
for (Integer inputId: context.getStreamNodeIds(parentTransformations.get(0))) {
streamGraph.addEdge(inputId, transformationId, 0);
}
return Collections.singleton(transformationId);
}
}
(2) 没有对应转换器的情况
这里其实只会涉及早期的两类转换
- FeedbackTransformation:单流迭代(例如 Bulk Iteration)
- CoFeedbackTransformation:双流迭代
目前已经换成KeyedProcessFuntion了,会走translate,因此,这里不做过多讲解
二.DataStream类相关操作补充
DataStream 上常见的 transformation 有 map、flatmap、filter 等。这些 transformation 会构造出一棵 StreamTransformation 树,通过这棵树转换成 StreamGraph。以 map 为例,分析 List<Transformation<?>> transformations 的数据:
public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper) {
TypeInformation<R> outType = TypeExtractor.getMapReturnTypes(clean(mapper), getType(),
Utils.getCallLocationName(), true);
return map(mapper, outType);
}
public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper, TypeInformation<R> outputType) {
// 返回一个新的DataStream,StreamMap 为 StreamOperator 的实现类
return transform("Map", outputType, new StreamMap<>(clean(mapper))); // 调用下面的transform方法
}
public <R> SingleOutputStreamOperator<R> transform(
String operatorName,
TypeInformation<R> outTypeInfo,
OneInputStreamOperator<T, R> operator) {
return doTransform(operatorName, outTypeInfo, SimpleOperatorFactory.of(operator)); // 调用下面的doTransform方法
}
protected <R> SingleOutputStreamOperator<R> doTransform(
String operatorName,
TypeInformation<R> outTypeInfo,
StreamOperatorFactory<R> operatorFactory) {
// read the output type of the input Transform to coax out errors about MissingTypeInfo
transformation.getOutputType();
// 新的 transformation 会连接上当前 DataStream 中的 transformation,从而构建成一棵树
OneInputTransformation<T, R> resultTransform = new OneInputTransformation<>(
this.transformation,
operatorName,
operatorFactory,
outTypeInfo,
environment.getParallelism());
@SuppressWarnings({"unchecked", "rawtypes"})
SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(environment, resultTransform);
// 所有的 transformation 都会存到 env 中,然后StreamGraphGenerator的generate方法会遍历该 list 生成 StreamGraph
getExecutionEnvironment().addOperator(resultTransform);
return returnStream;
}
从上方代码可以了解到,map 转换将用户自定义的函数 MapFunction 包装到 StreamMap这个 Operator 中,再将 StreamMap 包装到 OneInputTransformation,最后该 transformation 存到 env 中,当调用env.execute 时,StreamGraphGenerator的generate方法会遍历其中的 transformation 集合构造出 StreamGraph。其分层实现如下图所示:
另外,并不是每一个 StreamTransformation 都会转换成 runtime 层中物理操作。有一些只是逻辑概念,比如 union、split/select、partition 等。如下图所示的转换树,在运行时会优化成下方的操作图。
源码中assignTimestampsAndWatermarks、doTransform、addSink、executeAndCollectWithClient都会调env的addOperator方法