Spark Structured Streaming 与 Flink不同的checkpoint实现方式（中）这个系列文章

开篇

这个系列文章将分为上中下三篇，上篇理论，中篇Flink实现代码。下篇Spark实现代码。将讲解一下checkpoint在spark和flink里面的实现方式，以及大概为什么要这样实现，这里只讨论实时系统，别的不包括在内。此篇为中篇！

Flink

上篇说到Flink要做到用checkpoint表达整个系统的状态时，要保证图中的边上是不存在事件的，在Flink内部用CheckPointBarrier来表示这样的一个特殊的标记，当Flink收到这样的一个标记时，将缓存接下来的输入，直到所有partition都收到相同Id的Barrier。

说明

下面的Flink代码都基于Flink 1.10

起点

面对这么多代码的时候，基本是无从下手的，我们可以从一个简单的类出发，找一些简单的UT，来debug一下，流程是怎么样的。

开始

CheckpointBarrier

我们也不是从0开始的，大概知道有叫Barrier相关的类，借助Intellij，轻松找到有个叫org.apache.flink.runtime.io.network.api.CheckpointBarrier#CheckpointBarrier的类。继续command+B看谁调用了这个类。

看到有个ut的类：org.apache.flink.streaming.runtime.tasks.OneInputStreamTaskTest#testCheckpointBarriers调用了这个Barrier

	/**
	 * This test verifies that checkpoint barriers are correctly forwarded.
	 */
	@Test
	public void testCheckpointBarriers() throws Exception {
		final OneInputStreamTaskTestHarness<String, String> testHarness = new OneInputStreamTaskTestHarness<>(
				OneInputStreamTask::new,
				2, 2,
				BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO);

		testHarness.setupOutputForSingletonOperatorChain();

		StreamConfig streamConfig = testHarness.getStreamConfig();
		StreamMap<String, String> mapOperator = new StreamMap<>(new IdentityMap());
		streamConfig.setStreamOperator(mapOperator);
		streamConfig.setOperatorID(new OperatorID());

		ConcurrentLinkedQueue<Object> expectedOutput = new ConcurrentLinkedQueue<>();
		long initialTime = 0L;

		testHarness.invoke();
		testHarness.waitForTaskRunning();

		testHarness.processEvent(new CheckpointBarrier(0, 0, CheckpointOptions.forCheckpointWithDefaultLocation()), 0, 0);

		// These elements should be buffered until we receive barriers from
		// all inputs
		testHarness.processElement(new StreamRecord<>("Hello-0-0", initialTime), 0, 0);
		testHarness.processElement(new StreamRecord<>("Ciao-0-0", initialTime), 0, 0);

		// These elements should be forwarded, since we did not yet receive a checkpoint barrier
		// on that input, only add to same input, otherwise we would not know the ordering
		// of the output since the Task might read the inputs in any order
		testHarness.processElement(new StreamRecord<>("Hello-1-1", initialTime), 1, 1);
		testHarness.processElement(new StreamRecord<>("Ciao-1-1", initialTime), 1, 1);
		expectedOutput.add(new StreamRecord<>("Hello-1-1", initialTime));
		expectedOutput.add(new StreamRecord<>("Ciao-1-1", initialTime));

		testHarness.waitForInputProcessing();
		// we should not yet see the barrier, only the two elements from non-blocked input
		TestHarnessUtil.assertOutputEquals("Output was not correct.", expectedOutput, testHarness.getOutput());

		testHarness.processEvent(new CheckpointBarrier(0, 0, CheckpointOptions.forCheckpointWithDefaultLocation()), 0, 1);
		testHarness.processEvent(new CheckpointBarrier(0, 0, CheckpointOptions.forCheckpointWithDefaultLocation()), 1, 0);
		testHarness.processEvent(new CheckpointBarrier(0, 0, CheckpointOptions.forCheckpointWithDefaultLocation()), 1, 1);

		testHarness.waitForInputProcessing();

		// now we should see the barrier and after that the buffered elements
		expectedOutput.add(new CheckpointBarrier(0, 0, CheckpointOptions.forCheckpointWithDefaultLocation()));
		expectedOutput.add(new StreamRecord<>("Hello-0-0", initialTime));
		expectedOutput.add(new StreamRecord<>("Ciao-0-0", initialTime));

		testHarness.endInput();

		testHarness.waitForTaskCompletion();

		TestHarnessUtil.assertOutputEquals("Output was not correct.", expectedOutput, testHarness.getOutput());
	}

我们进入到invoke方法里看一下

	/**
	 * Invoke the Task. This resets the output of any previous invocation. This will start a new
	 * Thread to execute the Task in. Use {@link #waitForTaskCompletion()} to wait for the
	 * Task thread to finish running.
	 *
	 */
	public Thread invoke(StreamMockEnvironment mockEnv) throws Exception {
		checkState(this.mockEnv == null);
		checkState(this.taskThread == null);
		this.mockEnv = checkNotNull(mockEnv);

		initializeInputs();
		initializeOutput();

		taskThread = new TaskThread(() -> taskFactory.apply(mockEnv));
		taskThread.start();
		// Wait until the task is set
		while (taskThread.task == null) {
			Thread.sleep(10L);
		}

		return taskThread;
	}

最终在TaskThread的run方法中找到task.invoke方法，最终定位到org.apache.flink.streaming.runtime.tasks.StreamTask#invoke方法

	@Override
	public final void invoke() throws Exception {
		try {
			beforeInvoke();

			// final check to exit early before starting to run
			if (canceled) {
				throw new CancelTaskException();
			}

			// let the task do its work
			runMailboxLoop();

			// if this left the run() method cleanly despite the fact that this was canceled,
			// make sure the "clean shutdown" is not attempted
			if (canceled) {
				throw new CancelTaskException();
			}

			afterInvoke();
		}
		finally {
			cleanUpInvoke();
		}

沿着runMailboxLoop一路追进去，到了 org.apache.flink.streaming.runtime.tasks.mailbox.MailboxDefaultAction#runDefaultAction 其中的defaultAction其实是this.mailboxProcessor = new MailboxProcessor(this::processInput, mailbox, actionExecutor); 就是processInput方法接着沿着该方法一路追进去，才到了主角org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput#emitNext

	@Override
	public InputStatus emitNext(DataOutput<T> output) throws Exception {

		while (true) {
			// get the stream element from the deserializer
			if (currentRecordDeserializer != null) {
				DeserializationResult result = currentRecordDeserializer.getNextRecord(deserializationDelegate);
				if (result.isBufferConsumed()) {
					currentRecordDeserializer.getCurrentBuffer().recycleBuffer();
					currentRecordDeserializer = null;
				}

				if (result.isFullRecord()) {
					processElement(deserializationDelegate.getInstance(), output);
					return InputStatus.MORE_AVAILABLE;
				}
			}

			Optional<BufferOrEvent> bufferOrEvent = checkpointedInputGate.pollNext();
			if (bufferOrEvent.isPresent()) {
				processBufferOrEvent(bufferOrEvent.get());
			} else {
				if (checkpointedInputGate.isFinished()) {
					checkState(checkpointedInputGate.getAvailableFuture().isDone(), "Finished BarrierHandler should be available");
					if (!checkpointedInputGate.isEmpty()) {
						throw new IllegalStateException("Trailing data in checkpoint barrier handler.");
					}
					return InputStatus.END_OF_INPUT;
				}
				return InputStatus.NOTHING_AVAILABLE;
			}
		}

跟着pollNext进去，到org.apache.flink.streaming.runtime.io.CheckpointedInputGate#pollNextorg.apache.flink.streaming.runtime.io.CheckpointedInputGate#pollNext

这里终于到了我们说的Flink Barrier具体的行为

	@Override
	public Optional<BufferOrEvent> pollNext() throws Exception {
		while (true) {
			// process buffered BufferOrEvents before grabbing new ones
			Optional<BufferOrEvent> next;
			if (bufferStorage.isEmpty()) {
				next = inputGate.pollNext();
			}
			else {
				next = bufferStorage.pollNext();
				if (!next.isPresent()) {
					return pollNext();
				}
			}

			if (!next.isPresent()) {
				return handleEmptyBuffer();
			}

			BufferOrEvent bufferOrEvent = next.get();
			if (barrierHandler.isBlocked(offsetChannelIndex(bufferOrEvent.getChannelIndex()))) {
				// if the channel is blocked, we just store the BufferOrEvent
				bufferStorage.add(bufferOrEvent);
				if (bufferStorage.isFull()) {
					barrierHandler.checkpointSizeLimitExceeded(bufferStorage.getMaxBufferedBytes());
					bufferStorage.rollOver();
				}
			}
			else if (bufferOrEvent.isBuffer()) {
				return next;
			}
			else if (bufferOrEvent.getEvent().getClass() == CheckpointBarrier.class) {
				CheckpointBarrier checkpointBarrier = (CheckpointBarrier) bufferOrEvent.getEvent();
				if (!endOfInputGate) {
					// process barriers only if there is a chance of the checkpoint completing
					if (barrierHandler.processBarrier(checkpointBarrier, offsetChannelIndex(bufferOrEvent.getChannelIndex()), bufferStorage.getPendingBytes())) {
						bufferStorage.rollOver();
					}
				}
			}
			else if (bufferOrEvent.getEvent().getClass() == CancelCheckpointMarker.class) {
				if (barrierHandler.processCancellationBarrier((CancelCheckpointMarker) bufferOrEvent.getEvent())) {
					bufferStorage.rollOver();
				}
			}
			else {
				if (bufferOrEvent.getEvent().getClass() == EndOfPartitionEvent.class) {
					if (barrierHandler.processEndOfPartition()) {
						bufferStorage.rollOver();
					}
				}
				return next;
			}
		}
	}

这里可以看到当isBlocked为true时，事件是进入到一个bufferStorage中，同时收到Barrier事件时，也有一个handler来进行processBarrier。我们找一个handler的实现看看，org.apache.flink.streaming.runtime.io.CheckpointBarrierAlignerorg.apache.flink.streaming.runtime.io.CheckpointBarrierAligner。

	@Override
	public boolean processBarrier(CheckpointBarrier receivedBarrier, int channelIndex, long bufferedBytes) throws Exception {
		final long barrierId = receivedBarrier.getId();

		// fast path for single channel cases
		if (totalNumberOfInputChannels == 1) {
			if (barrierId > currentCheckpointId) {
				// new checkpoint
				currentCheckpointId = barrierId;
				notifyCheckpoint(receivedBarrier, bufferedBytes, latestAlignmentDurationNanos);
			}
			return false;
		}

		boolean checkpointAborted = false;

		// -- general code path for multiple input channels --

		if (numBarriersReceived > 0) {
			// this is only true if some alignment is already progress and was not canceled

			if (barrierId == currentCheckpointId) {
				// regular case
				onBarrier(channelIndex);
			}
			else if (barrierId > currentCheckpointId) {
				// we did not complete the current checkpoint, another started before
				LOG.warn("{}: Received checkpoint barrier for checkpoint {} before completing current checkpoint {}. " +
						"Skipping current checkpoint.",
					taskName,
					barrierId,
					currentCheckpointId);

				// let the task know we are not completing this
				notifyAbort(currentCheckpointId,
					new CheckpointException(
						"Barrier id: " + barrierId,
						CheckpointFailureReason.CHECKPOINT_DECLINED_SUBSUMED));

				// abort the current checkpoint
				releaseBlocksAndResetBarriers();
				checkpointAborted = true;

				// begin a new checkpoint
				beginNewAlignment(barrierId, channelIndex, receivedBarrier.getTimestamp());
			}
			else {
				// ignore trailing barrier from an earlier checkpoint (obsolete now)
				return false;
			}
		}
		else if (barrierId > currentCheckpointId) {
			// first barrier of a new checkpoint
			beginNewAlignment(barrierId, channelIndex, receivedBarrier.getTimestamp());
		}
		else {
			// either the current checkpoint was canceled (numBarriers == 0) or
			// this barrier is from an old subsumed checkpoint
			return false;
		}

		// check if we have all barriers - since canceled checkpoints always have zero barriers
		// this can only happen on a non canceled checkpoint
		if (numBarriersReceived + numClosedChannels == totalNumberOfInputChannels) {
			// actually trigger checkpoint
			if (LOG.isDebugEnabled()) {
				LOG.debug("{}: Received all barriers, triggering checkpoint {} at {}.",
					taskName,
					receivedBarrier.getId(),
					receivedBarrier.getTimestamp());
			}

			releaseBlocksAndResetBarriers();
			notifyCheckpoint(receivedBarrier, bufferedBytes, latestAlignmentDurationNanos);
			return true;
		}
		return checkpointAborted;
	}

因为大部分是if-else，这里就简单说一下，以多个input为例，这里的判断标准是

		// check if we have all barriers - since canceled checkpoints always have zero barriers
		// this can only happen on a non canceled checkpoint
		if (numBarriersReceived + numClosedChannels == totalNumberOfInputChannels) {
			// actually trigger checkpoint
			if (LOG.isDebugEnabled()) {
				LOG.debug("{}: Received all barriers, triggering checkpoint {} at {}.",
					taskName,
					receivedBarrier.getId(),
					receivedBarrier.getTimestamp());
			}

			releaseBlocksAndResetBarriers();
			notifyCheckpoint(receivedBarrier, bufferedBytes, latestAlignmentDurationNanos);
			return true;

也就是所有Barrier都到齐，同时清除之前block的标志位。另外一些if-else逻辑大部分是异常处理的情况，例如当前checkpoint还没结束，又收到更新的checkpointBarrier，类似这些。代码肯定比我表述的更清除。

后记

文字比代码苍白，大家可以根据Flink官方文档描述的文字，结合下这些代码来对照下，Flink是怎么做到对齐Barrier，并且做到容错的。当你有好奇心去探索下这个代码的时候，可以对照着这篇文章，因为代码链路比较长，文章里也截取了一部分必要的代码，还是建议有条件的同学可以自己追一下代码试试，追不到的时候，直接进入文章里写的类名，继续追即可。

结尾

我个人文笔比较差，没法写出像很多爆火文章里的那种讲故事类型的文章。写文章只是为了我个人理清思路，也是为了践行一下开源的精神。毕竟分享知识也是也是一种开源的精神，如果文章有什么写的不好的地方，或者可以改进的地方，可以留言。但我文章估计也没啥人看。。。如果真有人看到这里的话，希望对你有点帮助。

公众号

微信公众号：进击的大数据

关注大数据方面技术。问题或建议，请公众号留言。公众号里文章能多点，想看我也不拦着你，哈哈。