Apache Seatunnel基于Flink的入门

1,546 阅读4分钟

最近在调研Seatunnel这个框架,并基于Spark进行了一番学习,发现确实非常的方便。由于某些项目可能只会使用Flink,于是今天继续来看看Seatunnel是如何基于Flink来运行的。

Demo运行

这里选用源码中的demo进行运行,配置文件如下

env {
  # You can set flink configuration here
  execution.parallelism = 2
  job.mode = "STREAMING"
  #execution.checkpoint.interval = 10000
  #execution.checkpoint.data-uri = "hdfs://localhost:9000/checkpoint"
}

source {
  # This is a example source plugin **only for test and demonstrate the feature source plugin**
  FakeSource {
    parallelism = 2
    result_table_name = "fake"
    row.num = 16
    schema = {
      fields {
        name = "string"
        age = "int"
      }
    }
  }

  # If you would like to get more information about how to configure seatunnel and see full list of source plugins,
  # please go to https://seatunnel.apache.org/docs/category/source-v2
}

transform {

  # If you would like to get more information about how to configure seatunnel and see full list of transform plugins,
  # please go to https://seatunnel.apache.org/docs/category/transform
}

sink {
  Console {
    parallelism = 3
  }

  # If you would like to get more information about how to configure seatunnel and see full list of sink plugins,
  # please go to https://seatunnel.apache.org/docs/category/sink-v2
}

然后运行其main函数就可以

public class SeaTunnelApiExample {

    public static void main(String[] args)
            throws FileNotFoundException, URISyntaxException, CommandException {
        String configurePath = args.length > 0 ? args[0] : "/examples/fake_to_console.conf";
        String configFile = getTestConfigFile(configurePath);
        FlinkCommandArgs flinkCommandArgs = new FlinkCommandArgs();
        flinkCommandArgs.setConfigFile(configFile);
        flinkCommandArgs.setCheckConfig(false);
        flinkCommandArgs.setVariables(null);
        SeaTunnel.run(flinkCommandArgs.buildCommand());
    }

    public static String getTestConfigFile(String configFile)
            throws FileNotFoundException, URISyntaxException {
        URL resource = SeaTunnelApiExample.class.getResource(configFile);
        if (resource == null) {
            throw new FileNotFoundException("Can't find config file: " + configFile);
        }
        return Paths.get(resource.toURI()).toString();
    }
}

Source源码浅析

这里以demo中的FakeSource为案例来分析其在Flink下是如何运行的。

初始化

FlinkTaskExecuteCommand

这里进行了FlinkExecution的初始化

FlinkExecution seaTunnelTaskExecution = new FlinkExecution(config);
try {
	seaTunnelTaskExecution.execute();
} catch (Exception e) {
	throw new CommandExecuteException("Flink job executed failed", e);
}

FlinkExecution的构造函数中会去做很多事情

public FlinkExecution(Config config) {
	try {
		jarPaths =
				new ArrayList<>(
						Collections.singletonList(
								new File(
												Common.appStarterDir()
														.resolve(FlinkStarter.APP_JAR_NAME)
														.toString())
										.toURI()
										.toURL()));
	} catch (MalformedURLException e) {
		throw new SeaTunnelException("load flink starter error.", e);
	}
	registerPlugin(config.getConfig("env"));
	JobContext jobContext = new JobContext();
	jobContext.setJobMode(RuntimeEnvironment.getJobMode(config));

	this.sourcePluginExecuteProcessor =
			new SourceExecuteProcessor(
					jarPaths, config.getConfigList(Constants.SOURCE), jobContext);
	this.transformPluginExecuteProcessor =
			new TransformExecuteProcessor(
					jarPaths,
					TypesafeConfigUtils.getConfigList(
							config, Constants.TRANSFORM, Collections.emptyList()),
					jobContext);
	this.sinkPluginExecuteProcessor =
			new SinkExecuteProcessor(
					jarPaths, config.getConfigList(Constants.SINK), jobContext);

	this.flinkRuntimeEnvironment =
			FlinkRuntimeEnvironment.getInstance(this.registerPlugin(config, jarPaths));

	this.sourcePluginExecuteProcessor.setRuntimeEnvironment(flinkRuntimeEnvironment);
	this.transformPluginExecuteProcessor.setRuntimeEnvironment(flinkRuntimeEnvironment);
	this.sinkPluginExecuteProcessor.setRuntimeEnvironment(flinkRuntimeEnvironment);
}

上面的代码中初始化了source、transform、sink执行的process。并且初始化了flinkRuntimeEnvironment,里面包括了Flink的运行环境。

private void createStreamEnvironment() {
	Configuration configuration = new Configuration();
	EnvironmentUtil.initConfiguration(config, configuration);
	environment = StreamExecutionEnvironment.getExecutionEnvironment(configuration);
	setTimeCharacteristic();

	setCheckpoint();

	EnvironmentUtil.setRestartStrategy(config, environment.getConfig());

	if (config.hasPath(ConfigKeyName.BUFFER_TIMEOUT_MILLIS)) {
		long timeout = config.getLong(ConfigKeyName.BUFFER_TIMEOUT_MILLIS);
		environment.setBufferTimeout(timeout);
	}

	if (config.hasPath(ConfigKeyName.PARALLELISM)) {
		int parallelism = config.getInt(ConfigKeyName.PARALLELISM);
		environment.setParallelism(parallelism);
	}

	if (config.hasPath(ConfigKeyName.MAX_PARALLELISM)) {
		int max = config.getInt(ConfigKeyName.MAX_PARALLELISM);
		environment.setMaxParallelism(max);
	}

	if (this.jobMode.equals(JobMode.BATCH)) {
		environment.setRuntimeMode(RuntimeExecutionMode.BATCH);
	}
}

public static void initConfiguration(Config config, Configuration configuration) {
	if (config.hasPath("pipeline")) {
		Config pipeline = config.getConfig("pipeline");
		if (pipeline.hasPath("jars")) {
			configuration.setString(PipelineOptions.JARS.key(), pipeline.getString("jars"));
		}
		if (pipeline.hasPath("classpaths")) {
			configuration.setString(
					PipelineOptions.CLASSPATHS.key(), pipeline.getString("classpaths"));
		}
	}
}

上面的代码就是Flink运行环境的初始化,并且将插件运行依赖的jar包设置到了configuration中的pipeline.jars和pipeline.classpaths参数中。

执行

接下来就是执行execute方法了

@Override
public void execute() throws TaskExecuteException {
	List<DataStream<Row>> dataStreams = new ArrayList<>();
	dataStreams = sourcePluginExecuteProcessor.execute(dataStreams);
	dataStreams = transformPluginExecuteProcessor.execute(dataStreams);
	sinkPluginExecuteProcessor.execute(dataStreams);
	log.info(
			"Flink Execution Plan: {}",
			flinkRuntimeEnvironment.getStreamExecutionEnvironment().getExecutionPlan());
	log.info("Flink job name: {}", flinkRuntimeEnvironment.getJobName());
	try {
		flinkRuntimeEnvironment
				.getStreamExecutionEnvironment()
				.execute(flinkRuntimeEnvironment.getJobName());
	} catch (Exception e) {
		throw new TaskExecuteException("Execute Flink job error", e);
	}
}

这里最终其实就是去提交Flink任务,下面主要来看看代码中是如何将Seatunnel中的source和sink转为flink中的source或者sink的。

Flink Source的实现

我们进入到SourceExecuteProcessor中的execute方法中

public List<DataStream<Row>> execute(List<DataStream<Row>> upstreamDataStreams) {
	StreamExecutionEnvironment executionEnvironment =
			flinkRuntimeEnvironment.getStreamExecutionEnvironment();
	List<DataStream<Row>> sources = new ArrayList<>();
	for (int i = 0; i < plugins.size(); i++) {
		SeaTunnelSource internalSource = plugins.get(i);
		BaseSeaTunnelSourceFunction sourceFunction;
		if (internalSource instanceof SupportCoordinate) {
			sourceFunction = new SeaTunnelCoordinatedSource(internalSource);
		} else {
			sourceFunction = new SeaTunnelParallelSource(internalSource);
		}
		DataStreamSource<Row> sourceStream =
				addSource(
						executionEnvironment,
						sourceFunction,
						"SeaTunnel " + internalSource.getClass().getSimpleName(),
						internalSource.getBoundedness()
								== org.apache.seatunnel.api.source.Boundedness.BOUNDED);
		Config pluginConfig = pluginConfigs.get(i);
		if (pluginConfig.hasPath(CommonOptions.PARALLELISM.key())) {
			int parallelism = pluginConfig.getInt(CommonOptions.PARALLELISM.key());
			sourceStream.setParallelism(parallelism);
		}
		registerResultTable(pluginConfig, sourceStream);
		sources.add(sourceStream);
	}
	return sources;
}

上面的代码中会先根据config文件中配置的插件信息获取到具体的SeaTunnelSource。

然后根据它在获取到SeaTunnelParallelSource。并最终通过addSource方法来获取到Flink自身的DataStreamSource

SeaTunnelParallelSource是个比较核心的类,来看看它的实现

/** The parallel source function implementation of {@link BaseSeaTunnelSourceFunction} */
public class SeaTunnelParallelSource extends BaseSeaTunnelSourceFunction
        implements ParallelSourceFunction<Row> {

    protected static final String PARALLEL_SOURCE_STATE_NAME = "parallel-source-states";

    public SeaTunnelParallelSource(SeaTunnelSource<SeaTunnelRow, ?, ?> source) {
        // TODO: Make sure the source is uncoordinated.
        super(source);
    }

    @Override
    protected BaseSourceFunction<SeaTunnelRow> createInternalSource() {
        return new ParallelSource<>(
                source,
                restoredState,
                getRuntimeContext().getNumberOfParallelSubtasks(),
                getRuntimeContext().getIndexOfThisSubtask());
    }

    @Override
    protected String getStateName() {
        return PARALLEL_SOURCE_STATE_NAME;
    }
}

可以看到其继承了BaseSeaTunnelSourceFunction,并且实现了ParallelSourceFunction接口,说明SeaTunnelParallelSource是一个可并行的Source。 并且在BaseSeaTunnelSourceFunction中继承了 RichSourceFunction,因此SeaTunnelParallelSource也就具备了RichParallelSourceFunction的功能,可以访问Flink程序运行的上下文,以及具备生命周期等方法如open。

来关注一下BaseSeaTunnelSourceFunction几个核心方法:

@Override
public void open(Configuration parameters) throws Exception {
	super.open(parameters);
	this.internalSource = createInternalSource();
	this.internalSource.open();
}

open方法是每个Task实例都会去执行一次,这里创建了ParallelSource。并调用了其open方法。

@Override
protected BaseSourceFunction<SeaTunnelRow> createInternalSource() {
	return new ParallelSource<>(
			source,
			restoredState,
			getRuntimeContext().getNumberOfParallelSubtasks(),
			getRuntimeContext().getIndexOfThisSubtask());
}

可以看到在构造ParallelSource的时候传入了当前Task的并行度,以及当前subTask的index。当然还有具体的source实现。

public ParallelSource(
		SeaTunnelSource<T, SplitT, StateT> source,
		Map<Integer, List<byte[]>> restoredState,
		int parallelism,
		int subtaskId) {
	this.source = source;
	this.subtaskId = subtaskId;
	this.parallelism = parallelism;

	this.splitSerializer = source.getSplitSerializer();
	this.enumeratorStateSerializer = source.getEnumeratorStateSerializer();
	this.parallelEnumeratorContext =
			new ParallelEnumeratorContext<>(this, parallelism, subtaskId);
	this.readerContext = new ParallelReaderContext(this, source.getBoundedness(), subtaskId);

	// Create or restore split enumerator & reader
	try {
		if (restoredState != null && restoredState.size() > 0) {
			StateT restoredEnumeratorState = null;
			if (restoredState.containsKey(-1)) {
				restoredEnumeratorState =
						enumeratorStateSerializer.deserialize(restoredState.get(-1).get(0));
			}
			restoredSplitState = new ArrayList<>(restoredState.get(subtaskId).size());
			for (byte[] splitBytes : restoredState.get(subtaskId)) {
				restoredSplitState.add(splitSerializer.deserialize(splitBytes));
			}

			splitEnumerator =
					source.restoreEnumerator(
							parallelEnumeratorContext, restoredEnumeratorState);
		} else {
			restoredSplitState = Collections.emptyList();
			splitEnumerator = source.createEnumerator(parallelEnumeratorContext);
		}
		reader = source.createReader(readerContext);
	} catch (Exception e) {
		throw new RuntimeException(e);
	}
}

我们看restoredState==null的分支 ,就是先去创建了一个splitEnumerator。

这里我们选择FakeSource来看其具体实现,每个source都有自己的分片策略,这里FakeSource返回的是FakeSourceSplitEnumerator

@Override
public SourceSplitEnumerator<FakeSourceSplit, FakeSourceState> createEnumerator(
		SourceSplitEnumerator.Context<FakeSourceSplit> enumeratorContext) throws Exception {
	return new FakeSourceSplitEnumerator(enumeratorContext, fakeConfig, Collections.emptySet());
}

创建完分片策略后接着根据source去创建出其具体的reader,这里对应了FakeSourceReader

public SourceReader<SeaTunnelRow, FakeSourceSplit> createReader(
		SourceReader.Context readerContext) throws Exception {
	return new FakeSourceReader(readerContext, rowType, fakeConfig);
}

这里的enumeratorContext和readerContext都包含了subtask的index信息,方便后面分片逻辑的执行。

到这里ParallelSource的构造函数已经执行完毕。接下来就是调用其open方法了。

@Override
public void open() throws Exception {
	executorService =
			ThreadPoolExecutorFactory.createScheduledThreadPoolExecutor(
					1, String.format("parallel-split-enumerator-executor-%s", subtaskId));
	splitEnumerator.open();
	if (restoredSplitState.size() > 0) {
		splitEnumerator.addSplitsBack(restoredSplitState, subtaskId);
	}
	reader.open();
	parallelEnumeratorContext.register();
	splitEnumerator.registerReader(subtaskId);
}

对于FakeSource来说,上面的open方法都是空执行。主要就是初始化了一个线程池。

到这里BaseSeaTunnelSourceFunction的open方法已经结束了,下面就来看看其run方法吧。

@Override
public void run(SourceFunction.SourceContext<Row> sourceContext) throws Exception {
	internalSource.run(
			new RowCollector(
					sourceContext,
					sourceContext.getCheckpointLock(),
					source.getProducedType()));
	// Wait for a checkpoint to complete:
	// In the current version(version < 1.14.0), when the operator state of the source changes
	// to FINISHED, jobs cannot be checkpoint executed.
	final long prevCheckpointId = latestTriggerCheckpointId.get();
	// Ensured Checkpoint enabled
	if (getRuntimeContext() instanceof StreamingRuntimeContext
			&& ((StreamingRuntimeContext) getRuntimeContext()).isCheckpointingEnabled()) {
		while (running && prevCheckpointId >= latestCompletedCheckpointId.get()) {
			Thread.sleep(100);
		}
	}
}

run方法则是去触发reader去获取数据。其调用了ParallelSource的run方法。

@Override
public void run(Collector<T> collector) throws Exception {
	Future<?> future =
			executorService.submit(
					() -> {
						try {
							splitEnumerator.run();
						} catch (Exception e) {
							throw new RuntimeException("SourceSplitEnumerator run failed.", e);
						}
					});

	while (running) {
		if (future.isDone()) {
			future.get();
		}
		reader.pollNext(collector);
		Thread.sleep(SLEEP_TIME_INTERVAL);
	}
	LOG.debug("Parallel source runs complete.");
}

这里可以看到其执行了分片的splitEnumerator的run方法,并且每个Task都会执行一次且只会执行一次,当执行成功后,才会继续执行reader.pollNext()方法来获取数据。

先来看看FakeSourceSplitEnumerator中的run方法做了什么

@Override
public void run() throws Exception {
	discoverySplits();
	assignPendingSplits();
}

private void discoverySplits() {
	Set<FakeSourceSplit> allSplit = new HashSet<>();
	log.info("Starting to calculate splits.");
	int numReaders = enumeratorContext.currentParallelism();
	int readerRowNum = fakeConfig.getRowNum();
	int splitNum = fakeConfig.getSplitNum();
	int splitRowNum = (int) Math.ceil((double) readerRowNum / splitNum);
	for (int i = 0; i < numReaders; i++) {
		int index = i;
		for (int num = 0; num < readerRowNum; index += numReaders, num += splitRowNum) {
			allSplit.add(new FakeSourceSplit(index, Math.min(splitRowNum, readerRowNum - num)));
		}
	}

	assignedSplits.forEach(allSplit::remove);
	addSplitChangeToPendingAssignments(allSplit);
	log.info("Assigned {} to {} readers.", allSplit, numReaders);
	log.info("Calculated splits successfully, the size of splits is {}.", allSplit.size());
}

discoverySplits方法的作用就是初始化当前source所有的分片信息(这里不同的source对应其不同的实现,比如我们可以根据表的主键ID切分出10个sql,就在这里产生10个sourceSplit)。

这里的enumeratorContext就是在前面初始化的,包含了并行度和subtaskId。

this.parallelEnumeratorContext =
                new ParallelEnumeratorContext<>(this, parallelism, subtaskId);

private final Map<Integer, Set<FakeSourceSplit>> pendingSplits;

private void addSplitChangeToPendingAssignments(Collection<FakeSourceSplit> newSplits) {
	for (FakeSourceSplit split : newSplits) {
		int ownerReader = split.getSplitId() % enumeratorContext.currentParallelism();
		pendingSplits.computeIfAbsent(ownerReader, r -> new HashSet<>()).add(split);
	}
}

最终调用addSplitChangeToPendingAssignments方法将分片信息存储到Map结构中(实现将多个分片信息划分到不同的Task中去执行。)。

如果我们是2个并行度,那么这里会构造2个FakeSourceSplit,并且会打印如下日志

Assigned [FakeSourceSplit(splitId=1, rowNum=16), FakeSourceSplit(splitId=0, rowNum=16)] to 2 readers.

接着调用assignPendingSplits方法

private void assignPendingSplits() {
	// Check if there's any pending splits for given readers
	for (int pendingReader : enumeratorContext.registeredReaders()) {
		// Remove pending assignment for the reader
		final Set<FakeSourceSplit> pendingAssignmentForReader =
				pendingSplits.remove(pendingReader);

		if (pendingAssignmentForReader != null && !pendingAssignmentForReader.isEmpty()) {
			// Mark pending splits as already assigned
			synchronized (lock) {
				assignedSplits.addAll(pendingAssignmentForReader);
				// Assign pending splits to reader
				log.info(
						"Assigning splits to readers {} {}",
						pendingReader,
						pendingAssignmentForReader);
				enumeratorContext.assignSplit(
						pendingReader, new ArrayList<>(pendingAssignmentForReader));
				enumeratorContext.signalNoMoreSplits(pendingReader);
			}
		}
	}
}

这里就是调用了enumeratorContext的几个方法去分配具体的split给不同的Task。

@Override
public Set<Integer> registeredReaders() {
	return running ? Collections.singleton(subtaskId) : Collections.emptySet();
}

registeredReaders返回了当前Task的subTaskId。根据它可以获取到其对应的分片集合Set。

然后调用enumeratorContext的assignSplit方法将分片信息加入到了   parallelSource中。

@Override
public void assignSplit(int subtaskId, List<SplitT> splits) {
	if (this.subtaskId == subtaskId) {
		parallelSource.addSplits(splits);
	}
}

最终会将分片信息传入到具体的reader中,这里是FakeSourceReader

@Override
public void addSplits(List<FakeSourceSplit> splits) {
	log.debug("reader {} add splits {}", context.getIndexOfSubtask(), splits);
	this.splits.addAll(splits);
}

到这里ParallelSource类中run方法分片的逻辑执行完成了,并且已经将分片信息传入了reader中。接下来就是去执行reader.pollNext(collector)方法了。 下面是FakeSourceReader的实现

@SuppressWarnings("MagicNumber")
public void pollNext(Collector<SeaTunnelRow> output) throws InterruptedException {
	long currentTimestamp = Instant.now().toEpochMilli();
	if (currentTimestamp <= latestTimestamp + config.getSplitReadInterval()) {
		return;
	}
	latestTimestamp = currentTimestamp;
	synchronized (output.getCheckpointLock()) {
		FakeSourceSplit split = splits.poll();
		if (null != split) {
			// Randomly generated data are sent directly to the downstream operator
			fakeDataGenerator.collectFakedRows(split.getRowNum(), output);
			log.info(
					"{} rows of data have been generated in split({}). Generation time: {}",
					split.getRowNum(),
					split.splitId(),
					latestTimestamp);
		} else {
			if (!noMoreSplit) {
				log.info("wait split!");
			}
		}
	}
	if (noMoreSplit
			&& splits.isEmpty()
			&& Boundedness.BOUNDED.equals(context.getBoundedness())) {
		// signal to the source that we have reached the end of the data.
		log.info("Closed the bounded fake source");
		context.signalNoMoreElement();
	}
	Thread.sleep(1000L);
}

可以看到FakeSourceReader中取出了分片信息,并根据分片信息真正去产生数据了。不同的reader有不同的实现,大家可以多看看几个reader的具体实现。

总结

熟悉了上面的一些代码,发现如果自己实现一个Source思路会更加的清晰。关于其中的transform和sink大家可以自行去研究。