开篇
这个系列文章将分为上中下三篇, 上篇理论,中篇Flink实现代码。 下篇Spark实现代码。将讲解一下checkpoint在spark和flink里面的实现方式, 以及大概为什么要这样实现, 这里只讨论实时系统,别的不包括在内。 此篇为中篇!
Flink
上篇说到Flink要做到用checkpoint表达整个系统的状态时, 要保证图中的边上是不存在事件的, 在Flink内部用CheckPointBarrier来表示这样的一个特殊的标记, 当Flink收到这样的一个标记时, 将缓存接下来的输入, 直到所有partition都收到相同Id的Barrier。
说明
下面的Flink代码都基于Flink 1.10
起点
面对这么多代码的时候, 基本是无从下手的, 我们可以从一个简单的类出发, 找一些简单的UT,来debug一下,流程是怎么样的。
开始
CheckpointBarrier
我们也不是从0开始的, 大概知道有叫Barrier相关的类, 借助Intellij, 轻松找到有个叫org.apache.flink.runtime.io.network.api.CheckpointBarrier#CheckpointBarrier的类。继续command+B看谁调用了这个类。

看到有个ut的类:org.apache.flink.streaming.runtime.tasks.OneInputStreamTaskTest#testCheckpointBarriers调用了这个Barrier
/**
* This test verifies that checkpoint barriers are correctly forwarded.
*/
@Test
public void testCheckpointBarriers() throws Exception {
final OneInputStreamTaskTestHarness<String, String> testHarness = new OneInputStreamTaskTestHarness<>(
OneInputStreamTask::new,
2, 2,
BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO);
testHarness.setupOutputForSingletonOperatorChain();
StreamConfig streamConfig = testHarness.getStreamConfig();
StreamMap<String, String> mapOperator = new StreamMap<>(new IdentityMap());
streamConfig.setStreamOperator(mapOperator);
streamConfig.setOperatorID(new OperatorID());
ConcurrentLinkedQueue<Object> expectedOutput = new ConcurrentLinkedQueue<>();
long initialTime = 0L;
testHarness.invoke();
testHarness.waitForTaskRunning();
testHarness.processEvent(new CheckpointBarrier(0, 0, CheckpointOptions.forCheckpointWithDefaultLocation()), 0, 0);
// These elements should be buffered until we receive barriers from
// all inputs
testHarness.processElement(new StreamRecord<>("Hello-0-0", initialTime), 0, 0);
testHarness.processElement(new StreamRecord<>("Ciao-0-0", initialTime), 0, 0);
// These elements should be forwarded, since we did not yet receive a checkpoint barrier
// on that input, only add to same input, otherwise we would not know the ordering
// of the output since the Task might read the inputs in any order
testHarness.processElement(new StreamRecord<>("Hello-1-1", initialTime), 1, 1);
testHarness.processElement(new StreamRecord<>("Ciao-1-1", initialTime), 1, 1);
expectedOutput.add(new StreamRecord<>("Hello-1-1", initialTime));
expectedOutput.add(new StreamRecord<>("Ciao-1-1", initialTime));
testHarness.waitForInputProcessing();
// we should not yet see the barrier, only the two elements from non-blocked input
TestHarnessUtil.assertOutputEquals("Output was not correct.", expectedOutput, testHarness.getOutput());
testHarness.processEvent(new CheckpointBarrier(0, 0, CheckpointOptions.forCheckpointWithDefaultLocation()), 0, 1);
testHarness.processEvent(new CheckpointBarrier(0, 0, CheckpointOptions.forCheckpointWithDefaultLocation()), 1, 0);
testHarness.processEvent(new CheckpointBarrier(0, 0, CheckpointOptions.forCheckpointWithDefaultLocation()), 1, 1);
testHarness.waitForInputProcessing();
// now we should see the barrier and after that the buffered elements
expectedOutput.add(new CheckpointBarrier(0, 0, CheckpointOptions.forCheckpointWithDefaultLocation()));
expectedOutput.add(new StreamRecord<>("Hello-0-0", initialTime));
expectedOutput.add(new StreamRecord<>("Ciao-0-0", initialTime));
testHarness.endInput();
testHarness.waitForTaskCompletion();
TestHarnessUtil.assertOutputEquals("Output was not correct.", expectedOutput, testHarness.getOutput());
}
我们进入到invoke方法里看一下
/**
* Invoke the Task. This resets the output of any previous invocation. This will start a new
* Thread to execute the Task in. Use {@link #waitForTaskCompletion()} to wait for the
* Task thread to finish running.
*
*/
public Thread invoke(StreamMockEnvironment mockEnv) throws Exception {
checkState(this.mockEnv == null);
checkState(this.taskThread == null);
this.mockEnv = checkNotNull(mockEnv);
initializeInputs();
initializeOutput();
taskThread = new TaskThread(() -> taskFactory.apply(mockEnv));
taskThread.start();
// Wait until the task is set
while (taskThread.task == null) {
Thread.sleep(10L);
}
return taskThread;
}
最终在TaskThread的run方法中找到task.invoke方法,最终定位到org.apache.flink.streaming.runtime.tasks.StreamTask#invoke方法
@Override
public final void invoke() throws Exception {
try {
beforeInvoke();
// final check to exit early before starting to run
if (canceled) {
throw new CancelTaskException();
}
// let the task do its work
runMailboxLoop();
// if this left the run() method cleanly despite the fact that this was canceled,
// make sure the "clean shutdown" is not attempted
if (canceled) {
throw new CancelTaskException();
}
afterInvoke();
}
finally {
cleanUpInvoke();
}
沿着runMailboxLoop一路追进去,到了 org.apache.flink.streaming.runtime.tasks.mailbox.MailboxDefaultAction#runDefaultAction 其中的defaultAction其实是this.mailboxProcessor = new MailboxProcessor(this::processInput, mailbox, actionExecutor); 就是processInput方法 接着沿着该方法一路追进去, 才到了主角org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput#emitNext
@Override
public InputStatus emitNext(DataOutput<T> output) throws Exception {
while (true) {
// get the stream element from the deserializer
if (currentRecordDeserializer != null) {
DeserializationResult result = currentRecordDeserializer.getNextRecord(deserializationDelegate);
if (result.isBufferConsumed()) {
currentRecordDeserializer.getCurrentBuffer().recycleBuffer();
currentRecordDeserializer = null;
}
if (result.isFullRecord()) {
processElement(deserializationDelegate.getInstance(), output);
return InputStatus.MORE_AVAILABLE;
}
}
Optional<BufferOrEvent> bufferOrEvent = checkpointedInputGate.pollNext();
if (bufferOrEvent.isPresent()) {
processBufferOrEvent(bufferOrEvent.get());
} else {
if (checkpointedInputGate.isFinished()) {
checkState(checkpointedInputGate.getAvailableFuture().isDone(), "Finished BarrierHandler should be available");
if (!checkpointedInputGate.isEmpty()) {
throw new IllegalStateException("Trailing data in checkpoint barrier handler.");
}
return InputStatus.END_OF_INPUT;
}
return InputStatus.NOTHING_AVAILABLE;
}
}
跟着pollNext进去, 到org.apache.flink.streaming.runtime.io.CheckpointedInputGate#pollNextorg.apache.flink.streaming.runtime.io.CheckpointedInputGate#pollNext
- 这里终于到了我们说的Flink Barrier具体的行为
@Override
public Optional<BufferOrEvent> pollNext() throws Exception {
while (true) {
// process buffered BufferOrEvents before grabbing new ones
Optional<BufferOrEvent> next;
if (bufferStorage.isEmpty()) {
next = inputGate.pollNext();
}
else {
next = bufferStorage.pollNext();
if (!next.isPresent()) {
return pollNext();
}
}
if (!next.isPresent()) {
return handleEmptyBuffer();
}
BufferOrEvent bufferOrEvent = next.get();
if (barrierHandler.isBlocked(offsetChannelIndex(bufferOrEvent.getChannelIndex()))) {
// if the channel is blocked, we just store the BufferOrEvent
bufferStorage.add(bufferOrEvent);
if (bufferStorage.isFull()) {
barrierHandler.checkpointSizeLimitExceeded(bufferStorage.getMaxBufferedBytes());
bufferStorage.rollOver();
}
}
else if (bufferOrEvent.isBuffer()) {
return next;
}
else if (bufferOrEvent.getEvent().getClass() == CheckpointBarrier.class) {
CheckpointBarrier checkpointBarrier = (CheckpointBarrier) bufferOrEvent.getEvent();
if (!endOfInputGate) {
// process barriers only if there is a chance of the checkpoint completing
if (barrierHandler.processBarrier(checkpointBarrier, offsetChannelIndex(bufferOrEvent.getChannelIndex()), bufferStorage.getPendingBytes())) {
bufferStorage.rollOver();
}
}
}
else if (bufferOrEvent.getEvent().getClass() == CancelCheckpointMarker.class) {
if (barrierHandler.processCancellationBarrier((CancelCheckpointMarker) bufferOrEvent.getEvent())) {
bufferStorage.rollOver();
}
}
else {
if (bufferOrEvent.getEvent().getClass() == EndOfPartitionEvent.class) {
if (barrierHandler.processEndOfPartition()) {
bufferStorage.rollOver();
}
}
return next;
}
}
}
这里可以看到当isBlocked为true时, 事件是进入到一个bufferStorage中, 同时收到Barrier事件时,也有一个handler来进行processBarrier。我们找一个handler的实现看看,org.apache.flink.streaming.runtime.io.CheckpointBarrierAlignerorg.apache.flink.streaming.runtime.io.CheckpointBarrierAligner。
@Override
public boolean processBarrier(CheckpointBarrier receivedBarrier, int channelIndex, long bufferedBytes) throws Exception {
final long barrierId = receivedBarrier.getId();
// fast path for single channel cases
if (totalNumberOfInputChannels == 1) {
if (barrierId > currentCheckpointId) {
// new checkpoint
currentCheckpointId = barrierId;
notifyCheckpoint(receivedBarrier, bufferedBytes, latestAlignmentDurationNanos);
}
return false;
}
boolean checkpointAborted = false;
// -- general code path for multiple input channels --
if (numBarriersReceived > 0) {
// this is only true if some alignment is already progress and was not canceled
if (barrierId == currentCheckpointId) {
// regular case
onBarrier(channelIndex);
}
else if (barrierId > currentCheckpointId) {
// we did not complete the current checkpoint, another started before
LOG.warn("{}: Received checkpoint barrier for checkpoint {} before completing current checkpoint {}. " +
"Skipping current checkpoint.",
taskName,
barrierId,
currentCheckpointId);
// let the task know we are not completing this
notifyAbort(currentCheckpointId,
new CheckpointException(
"Barrier id: " + barrierId,
CheckpointFailureReason.CHECKPOINT_DECLINED_SUBSUMED));
// abort the current checkpoint
releaseBlocksAndResetBarriers();
checkpointAborted = true;
// begin a new checkpoint
beginNewAlignment(barrierId, channelIndex, receivedBarrier.getTimestamp());
}
else {
// ignore trailing barrier from an earlier checkpoint (obsolete now)
return false;
}
}
else if (barrierId > currentCheckpointId) {
// first barrier of a new checkpoint
beginNewAlignment(barrierId, channelIndex, receivedBarrier.getTimestamp());
}
else {
// either the current checkpoint was canceled (numBarriers == 0) or
// this barrier is from an old subsumed checkpoint
return false;
}
// check if we have all barriers - since canceled checkpoints always have zero barriers
// this can only happen on a non canceled checkpoint
if (numBarriersReceived + numClosedChannels == totalNumberOfInputChannels) {
// actually trigger checkpoint
if (LOG.isDebugEnabled()) {
LOG.debug("{}: Received all barriers, triggering checkpoint {} at {}.",
taskName,
receivedBarrier.getId(),
receivedBarrier.getTimestamp());
}
releaseBlocksAndResetBarriers();
notifyCheckpoint(receivedBarrier, bufferedBytes, latestAlignmentDurationNanos);
return true;
}
return checkpointAborted;
}
因为大部分是if-else, 这里就简单说一下, 以多个input为例, 这里的判断标准是
// check if we have all barriers - since canceled checkpoints always have zero barriers
// this can only happen on a non canceled checkpoint
if (numBarriersReceived + numClosedChannels == totalNumberOfInputChannels) {
// actually trigger checkpoint
if (LOG.isDebugEnabled()) {
LOG.debug("{}: Received all barriers, triggering checkpoint {} at {}.",
taskName,
receivedBarrier.getId(),
receivedBarrier.getTimestamp());
}
releaseBlocksAndResetBarriers();
notifyCheckpoint(receivedBarrier, bufferedBytes, latestAlignmentDurationNanos);
return true;
也就是所有Barrier都到齐, 同时清除之前block的标志位。另外一些if-else逻辑大部分是异常处理的情况, 例如当前checkpoint还没结束, 又收到更新的checkpointBarrier, 类似这些。代码肯定比我表述的更清除。
后记
文字比代码苍白,大家可以根据Flink官方文档描述的文字,结合下这些代码来对照下,Flink是怎么做到对齐Barrier,并且做到容错的。当你有好奇心去探索下这个代码的时候, 可以对照着这篇文章,因为代码链路比较长,文章里也截取了一部分必要的代码,还是建议有条件的同学可以自己追一下代码试试,追不到的时候,直接进入文章里写的类名,继续追即可。
结尾
我个人文笔比较差, 没法写出像很多爆火文章里的那种讲故事类型的文章。 写文章只是为了我个人理清思路, 也是为了践行一下开源的精神。毕竟分享知识也是也是一种开源的精神,如果文章有什么写的不好的地方,或者可以改进的地方,可以留言。但我文章估计也没啥人看。。。 如果真有人看到这里的话,希望对你有点帮助。
公众号
微信公众号:进击的大数据
关注大数据方面技术。问题或建议,请公众号留言。公众号里文章能多点, 想看我也不拦着你,哈哈。