Flink-Graph-2.StreamGraph生成源码

0 阅读7分钟

一.StreamGraph生成源码机制解析

0.先说结论

  1. 用户代码定义阶段:通过 StreamExecutionEnvironment 创建环境,调用 mapfilterprocess 等算子定义逻辑数据流,这些算子会被封装为 Transformation 对象并添加到环境的 transformations 列表中。
  2. StreamGraph 生成阶段:调用 env.getStreamGraph() 触发生成流程,核心由 StreamGraphGenerator 完成:
    • 遍历所有 Transformation,根据类型选择对应的 TransformationTranslator
    • 通过转换器将逻辑转换映射为 StreamNode(物理节点)和 StreamEdge(数据流边)
    • 构建完整的 StreamGraph,包含节点、边、资源配置等执行所需信息

生成StreamGraph的流程

  1. StreamExecutionEnvironment.getStreamGraph :调StreamGraphGenerator.generate
  2. -> StreamGraphGenerator.generate:遍历每个算子节点去调用transoform生成StreamGraph
  3. -> transform:根据当前节点去获取对应的转换器translator,并调用translate注册转换器去转换
  4. -> translate:由translate根据执行模式选择转换策略->translator实现类.translateForStreaming
    • ->例如OneInputTransformationTranslator.translateForStreaming再调父类AbstractOneInputTransformationTranslatortranslateInternal去真正转换,对上下游节点转为StreamNode,然后加一条边StreamEdge

生成transformation添加到List<Transformation<?>> transformations的情况

  • map|flatMap|filter|process算子 -> transform -> doTransform -> env的addOperator添加节点到transformations
  • assignTimestampsAndWatermarks -> env的addOperator添加节点到transformations
  • addSink -> env的addOperator添加节点到transformations
  • executeAndCollectWithClient -> env的addOperator添加节点到transformations

1.从StreamExecutionEnvironment入手

起初,我们要想执行一个Flink任务,是不是得先创建env对象,然后你去书写map、filter、process等算子,最终,交给env.execute()才能执行,那么我们先从env下手,

StreamExecutionEnvironmentgetStreamGraph方法如下

@Internal
public StreamGraph getStreamGraph(String jobName, boolean clearTransformations) {
        // 调用的是generate去构造StreamGraph
        StreamGraph streamGraph = getStreamGraphGenerator().setJobName(jobName).generate();
        if (clearTransformations) {
                this.transformations.clear();
        }
        return streamGraph;
}

// 将各个算子都添加到transformations这个list中,它的调用者是DataStream,具体操作看下面的2.DataStream解析
public void addOperator(Transformation<?> transformation) {  
    Preconditions.checkNotNull(transformation, "transformation must not be null.");  
    this.transformations.add(transformation); 
}

2.再从StreamGraphGenerator看

那么我们接着看getStreamGraphGenerator().setJobName(jobName).generate()generate() 这个时候就到了StreamGraphGenerator类了

public StreamGraph generate() {
        streamGraph = new StreamGraph(executionConfig, checkpointConfig, savepointRestoreSettings);
        shouldExecuteInBatchMode = shouldExecuteInBatchMode(runtimeExecutionMode);
        configureStreamGraph(streamGraph);

        alreadyTransformed = new HashMap<>();

        /*transformations是一个list,依次存放了 用户代码里的算子,如map、filter、process等*/
        // 遍历每个算子,并执行transform操作,其实DataStream底层就是一个Transformation
        for (Transformation<?> transformation: transformations) {
                // 执行transform操作,将这些transformation转换成StreamNode和StreamEdge,你可以理解为他会构造一颗树,一个节点通过边连接另外一个节点,最终形成StreamGraph
                transform(transformation);
        }

        final StreamGraph builtStreamGraph = streamGraph;

        alreadyTransformed.clear();
        alreadyTransformed = null;
        streamGraph = null;

        return builtStreamGraph;
}

然后我们看transform()

// 对每个 transformation 进行转换,转换成 StreamGraph 中的 StreamNode 和 StreamEdge
// 返回值为该 transform 的 id 集合,通常大小为 1 个(除 FeedbackTransformation)

private Collection<Integer> transform(Transformation<?> transform) {
        // 如果已经转换过,则直接返回转换后的结果即可
        if (alreadyTransformed.containsKey(transform)) {
                return alreadyTransformed.get(transform);
        }

        LOG.debug("Transforming " + transform);
        // 为每个算子配置并行度,不管
        if (transform.getMaxParallelism() <= 0) {

                // if the max parallelism hasn't been set, then first use the job wide max parallelism
                // from the ExecutionConfig.
                int globalMaxParallelismFromConfig = executionConfig.getMaxParallelism();
                if (globalMaxParallelismFromConfig > 0) {
                        transform.setMaxParallelism(globalMaxParallelismFromConfig);
                }
        }

        // 检测输出的数据类型,为了触发 MissingTypeInfo 的异常;如果出现了泛型擦除,这里会抛出异常
        transform.getOutputType();

        // 根据Transformation的类型选择对应的转换器
        /*
        private static final Map<Class<? extends Transformation>, TransformationTranslator<?, ? extends Transformation>> translatorMap;
        static {
                @SuppressWarnings("rawtypes")
                Map<Class<? extends Transformation>, TransformationTranslator<?, ? extends Transformation>> tmp = new HashMap<>();
                tmp.put(OneInputTransformation.class, new OneInputTransformationTranslator<>());
                tmp.put(TwoInputTransformation.class, new TwoInputTransformationTranslator<>());
                tmp.put(MultipleInputTransformation.class, new MultiInputTransformationTranslator<>());
                tmp.put(KeyedMultipleInputTransformation.class, new MultiInputTransformationTranslator<>());
                tmp.put(SourceTransformation.class, new SourceTransformationTranslator<>());
                tmp.put(SinkTransformation.class, new SinkTransformationTranslator<>());
                tmp.put(LegacySinkTransformation.class, new LegacySinkTransformationTranslator<>());
                tmp.put(LegacySourceTransformation.class, new LegacySourceTransformationTranslator<>());
                tmp.put(UnionTransformation.class, new UnionTransformationTranslator<>());
                tmp.put(PartitionTransformation.class, new PartitionTransformationTranslator<>());
                tmp.put(SideOutputTransformation.class, new SideOutputTransformationTranslator<>());
                tmp.put(ReduceTransformation.class, new ReduceTransformationTranslator<>());
                tmp.put(TimestampsAndWatermarksTransformation.class, new TimestampsAndWatermarksTransformationTranslator<>());
                tmp.put(BroadcastStateTransformation.class, new BroadcastStateTransformationTranslator<>());
                translatorMap = Collections.unmodifiableMap(tmp);
        }
        * */
        @SuppressWarnings("unchecked")
        final TransformationTranslator<?, Transformation<?>> translator =
                        (TransformationTranslator<?, Transformation<?>>) translatorMap.get(transform.getClass());

        Collection<Integer> transformedIds;
        // 如果找到了对应的转换器,调用translate注册转换器去转换
        if (translator != null) {
                transformedIds = translate(translator, transform);
        } else { // 如果没有对应的转换器,则调用legacyTransform进行处理
                transformedIds = legacyTransform(transform);
        }

        // 将转换的结果缓存起来,并返回
        if (!alreadyTransformed.containsKey(transform)) {
                alreadyTransformed.put(transform, transformedIds);
        }

        return transformedIds;
}

(1) 有对应转换器的情况

<1> translate()
private Collection<Integer> translate(
                final TransformationTranslator<?, Transformation<?>> translator,
                final Transformation<?> transform) {
        checkNotNull(translator);
        checkNotNull(transform);
        // 获取当前转换节点的所以父节点ID,可以理解为是上游吗?
        final List<Collection<Integer>> allInputIds = getParentInputIds(transform.getInputs());

        // 再次检查当前转换节点是否已经转换过了
        if (alreadyTransformed.containsKey(transform)) {
                return alreadyTransformed.get(transform);
        }

        // 确定slot共享组
        /* determineSlotSharingGroup方法逻辑
        * 如果当前转换已显式设置槽共享组,则使用该设置
        * 否则,尝试从父节点继承槽共享组配置
        * */
        final String slotSharingGroup = determineSlotSharingGroup(
                        transform.getSlotSharingGroup(),
                        allInputIds.stream()
                                        .flatMap(Collection::stream)
                                        .collect(Collectors.toList()));
        // 创建转换上下文
        final TransformationTranslator.Context context = new ContextImpl(
                        this, streamGraph, slotSharingGroup, configuration);
        // 根据执行模式选择转换策略
        return shouldExecuteInBatchMode
                        ? translator.translateForBatch(transform, context)
                        : translator.translateForStreaming(transform, context);
}
<2> 然后调用的translateForStreaming()

TransformationTranslator是一个接口,他的实现类是SimpleTransformationTranslator这个抽象父类 抽象父类实现的translateForStreaming会调用其子类的translateForStreamingInternal

@Override
public Collection<Integer> translateForStreaming(final T transformation, final Context context) {
        checkNotNull(transformation);
        checkNotNull(context);

        // 区分 map之类的转换算子(OneInput) 和 keyby值类的分区算子(partition)
        final Collection<Integer> transformedIds =
                        translateForStreamingInternal(transformation, context);
        configure(transformation, context);

        return transformedIds;
}
  • 无keyby的节点的实现类是OneInputTransformationTranslator
  • 有keyby的节点的实现类是PartitionTransformationTranslator
@Internal
public final class OneInputTransformationTranslator<IN, OUT>
		extends AbstractOneInputTransformationTranslator<IN, OUT, OneInputTransformation<IN, OUT>> {

	@Override
	public Collection<Integer> translateForBatchInternal(
			final OneInputTransformation<IN, OUT> transformation,
			final Context context) {
		KeySelector<IN, ?> keySelector = transformation.getStateKeySelector();
		Collection<Integer> ids = translateInternal(transformation,
			transformation.getOperatorFactory(),
			transformation.getInputType(),
			keySelector,
			transformation.getStateKeyType(),
			context
		);
		boolean isKeyed = keySelector != null;
		if (isKeyed) {
			BatchExecutionUtils.applySortingInputs(transformation.getId(), context);
		}

		return ids;
	}

	@Override
	public Collection<Integer> translateForStreamingInternal(
			final OneInputTransformation<IN, OUT> transformation,
			final Context context) {
		// 其实调用父类AbstractOneInputTransformationTranslator的translateInternal实现转换
		return translateInternal(transformation,
			transformation.getOperatorFactory(),
			transformation.getInputType(),
			transformation.getStateKeySelector(),
			transformation.getStateKeyType(),
			context
		);
	}
}
<3> 真正转换的人AbstractOneInputTransformationTranslator

实现情况如下

abstract class AbstractOneInputTransformationTranslator<IN, OUT, OP extends Transformation<OUT>>
		extends SimpleTransformationTranslator<OUT, OP> {

	protected Collection<Integer> translateInternal(
			final Transformation<OUT> transformation,
			final StreamOperatorFactory<OUT> operatorFactory,
			final TypeInformation<IN> inputType,
			@Nullable final KeySelector<IN, ?> stateKeySelector,
			@Nullable final TypeInformation<?> stateKeyType,
			final Context context) {
		checkNotNull(transformation);
		checkNotNull(operatorFactory);
		checkNotNull(inputType);
		checkNotNull(context);

		final StreamGraph streamGraph = context.getStreamGraph();
		final String slotSharingGroup = context.getSlotSharingGroup();
		final int transformationId = transformation.getId();
		final ExecutionConfig executionConfig = streamGraph.getExecutionConfig();

		/*调用StreamGraph的addOperator 将transformation转换为StreamNode
		* */
		streamGraph.addOperator(
			transformationId,          // 1. 节点唯一标识
			slotSharingGroup,          // 2. slot共享组配置
			transformation.getCoLocationGroupKey(), // 3. 同位置分组键
			operatorFactory,           // 4. 算子工厂
			inputType,                 // 5. 输入数据类型
			transformation.getOutputType(), // 6. 输出数据类型
			transformation.getName()); // 7. 节点名称

		/*这里是Flink带状态计算的关键
		* 1.创建键的序列化器(keySerializer),
		* 2.然后通过 setOneInputStateKey 配置状态分区规则
		* 这样做,能确保相同键的数据路由到同一任务中
		* */
		if (stateKeySelector != null) {
			TypeSerializer<?> keySerializer = stateKeyType.createSerializer(executionConfig);
			streamGraph.setOneInputStateKey(transformationId, stateKeySelector, keySerializer);
		}
		// 这里就是一些配置了,不管
		int parallelism = transformation.getParallelism() != ExecutionConfig.PARALLELISM_DEFAULT
			? transformation.getParallelism()
			: executionConfig.getParallelism();
		streamGraph.setParallelism(transformationId, parallelism);
		streamGraph.setMaxParallelism(transformationId, transformation.getMaxParallelism());

		// 这里返回的是当前转换节点的所有上游节点列表
		final List<Transformation<?>> parentTransformations = transformation.getInputs();
		// checkState 断言当前转换节点的上游只有1个节点,否则抛异常
		checkState(
			parentTransformations.size() == 1,
			"Expected exactly one input transformation but found " + parentTransformations.size());

		/* 添加StreamEdge
		* context.getStreamNodeIds 获取父转换的物理节点 ID
		* addEdge 添加数据流边,他的参数分别是源节点ID,目标节点ID,分区器索引
		* */
		for (Integer inputId: context.getStreamNodeIds(parentTransformations.get(0))) {
			streamGraph.addEdge(inputId, transformationId, 0);
		}

		return Collections.singleton(transformationId);
	}
}

(2) 没有对应转换器的情况

这里其实只会涉及早期的两类转换

  • FeedbackTransformation:单流迭代(例如 Bulk Iteration)
  • CoFeedbackTransformation:双流迭代

目前已经换成KeyedProcessFuntion了,会走translate,因此,这里不做过多讲解

二.DataStream类相关操作补充

DataStream 上常见的 transformation 有 map、flatmap、filter 等。这些 transformation 会构造出一棵 StreamTransformation 树,通过这棵树转换成 StreamGraph。以 map 为例,分析 List<Transformation<?>> transformations 的数据:

public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper) {

	TypeInformation<R> outType = TypeExtractor.getMapReturnTypes(clean(mapper), getType(),
			Utils.getCallLocationName(), true);

	return map(mapper, outType);
}

public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper, TypeInformation<R> outputType) {
	// 返回一个新的DataStream,StreamMap 为 StreamOperator 的实现类
	return transform("Map", outputType, new StreamMap<>(clean(mapper))); // 调用下面的transform方法
}

public <R> SingleOutputStreamOperator<R> transform(
		String operatorName,
		TypeInformation<R> outTypeInfo,
		OneInputStreamOperator<T, R> operator) {

	return doTransform(operatorName, outTypeInfo, SimpleOperatorFactory.of(operator)); // 调用下面的doTransform方法
}

protected <R> SingleOutputStreamOperator<R> doTransform(
		String operatorName,
		TypeInformation<R> outTypeInfo,
		StreamOperatorFactory<R> operatorFactory) {

	// read the output type of the input Transform to coax out errors about MissingTypeInfo
	transformation.getOutputType();

	// 新的 transformation 会连接上当前 DataStream 中的 transformation,从而构建成一棵树
	OneInputTransformation<T, R> resultTransform = new OneInputTransformation<>(
			this.transformation,
			operatorName,
			operatorFactory,
			outTypeInfo,
			environment.getParallelism());

	@SuppressWarnings({"unchecked", "rawtypes"})
	SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(environment, resultTransform);
	// 所有的 transformation 都会存到 env 中,然后StreamGraphGenerator的generate方法会遍历该 list 生成 StreamGraph
	getExecutionEnvironment().addOperator(resultTransform);

	return returnStream;
}

从上方代码可以了解到,map 转换将用户自定义的函数 MapFunction 包装到 StreamMap这个 Operator 中,再将 StreamMap 包装到 OneInputTransformation,最后该 transformation 存到 env 中,当调用env.execute 时,StreamGraphGenerator的generate方法会遍历其中的 transformation 集合构造出 StreamGraph。其分层实现如下图所示: image.png 另外,并不是每一个 StreamTransformation 都会转换成 runtime 层中物理操作。有一些只是逻辑概念,比如 union、split/select、partition 等。如下图所示的转换树,在运行时会优化成下方的操作图。

image.png

源码中assignTimestampsAndWatermarks、doTransform、addSink、executeAndCollectWithClient都会调env的addOperator方法