Flink源码解析 | 从Example出发：理解StreamGraph的生成过程

微信公众号:深广大数据Club关注可了解更多大数据相关资讯。问题或建议，请公众号留言;如果你觉得深广大数据Club对你有帮助，欢迎转发朋友圈

本文主要讲述Apache Flink流处理过程中，StreamGraph的生成过程。

以上这张图片为生成StreamGraph的代码调用流程图，接下来我们还是从example入手来学习理解StreamGraph的生成过程吧。

启动环境

这里我们同样从SocketWindowWordCount入手。我们这里先将Apache Flink启动起来，然后将example部署到Flink上运行。

至于如何启动的过程我们这里就不做赘述，详细的可以看本地部署的文章。《Apache Flink本地安装及Example运行》

从前面几篇文章我们可以了解到。Apache Flink要运行首先需要创建对应的Environment，stream对应的StreamExecutionEnvironment，batch的对应的ExecutionEnvironment。

SocketWindowWordCount

由于我们这里运行的是local模式，所以SocketWindowWordCount的Environment为LocalStreamEnvironment

// get the execution environmentfinal StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();...env.execute("Socket Window WordCount");

通过LocalStreamEnvironment.execute()方法来执行任务。

public JobExecutionResult execute(String jobName) throws Exception {    // transform the streaming program into a JobGraph    StreamGraph streamGraph = getStreamGraph();    streamGraph.setJobName(jobName);    ...}

这里我们关注的是StreamGraph，所以execute()方法中后续的内容这里就不讲，在后续的文章中会做进一步讲解。

execute方法调用getStreamGraph()获取StreamGraph实例。

public StreamGraph getStreamGraph() {    if (transformations.size() <= 0) {        throw new IllegalStateException("No operators defined in streaming topology. Cannot execute.");    }    return StreamGraphGenerator.generate(this, transformations);}

getStreamGraph()方法调用StreamGraphGenerator.generate()生成StreamGraph。从这里开始就是我们重点要讲的内容。

StreamGraphGenerator.generate()包含两个参数。

this //指定Environment为当前的LocalStreamEnvironment
transformations

检查LocalStreamEnvironment的父类StreamExecutionEnvironment可以发现，在StreamExecutionEnvironment变量中包含transformations，是一个List类型。

protected final List<StreamTransformation<?>> transformations = new ArrayList<>();

那这个List在哪里被填充？

Transformations列表填充

我们回到SocketWindowWordCount，从这里入手来理解这个List是怎么被填充的。

DataStream<WordWithCount> windowCounts = text    .flatMap(new FlatMapFunction<String, WordWithCount>() {        @Override        public void flatMap(String value, Collector<WordWithCount> out) {            for (String word : value.split("\\s")) {                out.collect(new WordWithCount(word, 1L));            }        }    })    .keyBy("word")    .timeWindow(Time.seconds(5))    .reduce(new ReduceFunction<WordWithCount>() {        @Override        public WordWithCount reduce(WordWithCount a, WordWithCount b) {            eturn new WordWithCount(a.word, a.count + b.count);        }    });

先看text.flatMap()

public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper) {    TypeInformation<R> outType = TypeExtractor.getFlatMapReturnTypes(clean(flatMapper),        getType(), Utils.getCallLocationName(), true);    return transform("Flat Map", outType, new StreamFlatMap<>(clean(flatMapper)));}

flatMap()方法调用transform()方法，传入operatorName="Flat Map"

public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperator<T, R> operator) {        // read the output type of the input Transform to coax out errors about MissingTypeInfo        transformation.getOutputType();        OneInputTransformation<T, R> resultTransform = new OneInputTransformation<>(                this.transformation,                operatorName,                operator,                outTypeInfo,                environment.getParallelism());        @SuppressWarnings({ "unchecked", "rawtypes" })        SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(environment, resultTransform);        getExecutionEnvironment().addOperator(resultTransform);        return returnStream;    }

获取resultTransform对象
通过environment、resultTransform生成returnStream
getExecutionEnvironment()获取Environment，此处获取的是LocalStreamEnvironment
返回

重点在第三步：getExecutionEnvironment().addOperator(resultTransform)

@Internalpublic void addOperator(StreamTransformation<?> transformation) {    Preconditions.checkNotNull(transformation, "transformation must not be null.");    this.transformations.add(transformation);}

transformations列表的填充就是通过Environment的addOperator方法进行数据填充。在execute方法执行之前，会将用户编写的DataStream操作代码转换为一个transformations列表。在execute执行的时候传入transformations生成SteamGraph。

StreamGraphGenerator

接下来我们继续看StreamGraphGenerator.generate()的内容

private StreamGraph generateInternal(List<StreamTransformation<?>> transformations) {    for (StreamTransformation<?> transformation: transformations) {        transform(transformation);    }    return streamGraph;}

方法主体是一个for循环，循环遍历transformations列表，并将单项transformation传入transform方法。

private Collection<Integer> transform(StreamTransformation<?> transform) {        if (alreadyTransformed.containsKey(transform)) {            return alreadyTransformed.get(transform);        }        LOG.debug("Transforming " + transform);        if (transform.getMaxParallelism() <= 0) {            // if the max parallelism hasn't been set, then first use the job wide max parallelism            // from theExecutionConfig.            int globalMaxParallelismFromConfig = env.getConfig().getMaxParallelism();            if (globalMaxParallelismFromConfig > 0) {                transform.setMaxParallelism(globalMaxParallelismFromConfig);            }        }        // call at least once to trigger exceptions about MissingTypeInfo        transform.getOutputType();        Collection<Integer> transformedIds;        if (transform instanceof OneInputTransformation<?, ?>) {            transformedIds = transformOneInputTransform((OneInputTransformation<?, ?>) transform);        } else if (transform instanceof TwoInputTransformation<?, ?, ?>) {            transformedIds = transformTwoInputTransform((TwoInputTransformation<?, ?, ?>) transform);        } else if (transform instanceof SourceTransformation<?>) {            transformedIds = transformSource((SourceTransformation<?>) transform);        } else if (transform instanceof SinkTransformation<?>) {            transformedIds = transformSink((SinkTransformation<?>) transform);        } else if (transform instanceof UnionTransformation<?>) {            transformedIds = transformUnion((UnionTransformation<?>) transform);        } else if (transform instanceof SplitTransformation<?>) {            transformedIds = transformSplit((SplitTransformation<?>) transform);        } else if (transform instanceof SelectTransformation<?>) {            transformedIds = transformSelect((SelectTransformation<?>) transform);        } else if (transform instanceof FeedbackTransformation<?>) {            transformedIds = transformFeedback((FeedbackTransformation<?>) transform);        } else if (transform instanceof CoFeedbackTransformation<?>) {            transformedIds = transformCoFeedback((CoFeedbackTransformation<?>) transform);        } else if (transform instanceof PartitionTransformation<?>) {            transformedIds = transformPartition((PartitionTransformation<?>) transform);        } else if (transform instanceof SideOutputTransformation<?>) {            transformedIds = transformSideOutput((SideOutputTransformation<?>) transform);        } else {            throw new IllegalStateException("Unknown transformation: " + transform);        }        // need this check because the iterate transformation adds itself before        // transforming the feedback edges        if (!alreadyTransformed.containsKey(transform)) {            alreadyTransformed.put(transform, transformedIds);        }        if (transform.getBufferTimeout() >= 0) {            streamGraph.setBufferTimeout(transform.getId(), transform.getBufferTimeout());        }        if (transform.getUid() != null) {            streamGraph.setTransformationUID(transform.getId(), transform.getUid());        }        if (transform.getUserProvidedNodeHash() != null) {            streamGraph.setTransformationUserHash(transform.getId(), transform.getUserProvidedNodeHash());        }        if (transform.getMinResources() != null && transform.getPreferredResources() != null) {            streamGraph.setResources(transform.getId(), transform.getMinResources(), transform.getPreferredResources());        }        return transformedIds;    }

transform()根据传入的transform的类型不同调用不同的transformXXX方法并返回transformedIds，并添加到transformedIds列表中，最终返回给上一层。

《Flink源码解析 | 从Example出发：读懂本地任务执行流程》

《Flink源码解析 | 从Example出发：读懂集群任务执行流程》

《Flink源码解析 | 从Example出发：读懂Flink On Yarn任务执行流程》

《Flink源码解析 | 从Example出发：读懂start-start-shell.sh任务执行流程》