承接上文:Flink源码阅读(二)checkPoint之产生原理。回顾一下上一篇提到的四个问题
- 那为什么在一个输入流的情况下也有checkpoint,如果是的话,是怎么生成checkpint快照的
- 假设一条数据落盘失败了,checkpoint能否支持从故障中恢复
- checckpoint保证一致性是指状态(state)的一致性,还是指数据的一致性?
- 这里说的buffers在源码层面指代什么?
本文目的用来解析问题二:假设一条数据落盘失败了,checkpoint能否支持从故障中恢复
通过上篇文章已经知道发生一次checkpoint的基本流程
1.准备检查点,允许算子进行一些预生成barrier工作。
2.向下游发送检查点barrier
3.准备在生成快照的缓冲区溢出以用于输入和输出
4.回放state快照。基于异步操作以免影响正在进行的checkpoint
同样参考Flink 1.11 官方文档 安全的流处理: ci.apache.org/projects/fl…
触发checkpoint (Starting Checkpoint)
1.进行一些的预检查,保证CheckpointCoordinator实例内的资源锁定,以保护检查点更新。
2.向JobMaster注册checkpoint信息
3.当取消任务的时候,停止正在调度器中,等待调度的checkpoint
4.设置每个算子已经完成的checkpoint数量,CheckpointCoordinator一条算子链subtask完成的checkpoint数量。为Jobmaster上注册的checkpoint,异步提交进行状态同步
5.生成checkpoint快照
CheckpointCoordinator#startTriggeringCheckpoint
private void startTriggeringCheckpoint(
long timestamp,
CheckpointProperties props,
@Nullable String externalSavepointLocation,
boolean isPeriodic,
boolean advanceToEndOfTime,
CompletableFuture<CompletedCheckpoint> onCompletionPromise) {
try {
// make some eager pre-checks
//进行一些的预检查,保证CheckpointCoordinator实例内的资源锁定,以保护检查点更新。
synchronized (lock) {
preCheckBeforeTriggeringCheckpoint(isPeriodic, props.forceCheckpoint());
}
final Execution[] executions = getTriggerExecutions();
final Map<ExecutionAttemptID, ExecutionVertex> ackTasks = getAckTasks();
// we will actually trigger this checkpoint!
Preconditions.checkState(!isTriggering);
isTriggering = true;
//向JobMaster注册checkpoint信息
final CompletableFuture<PendingCheckpoint> pendingCheckpointCompletableFuture =
initializeCheckpoint(props, externalSavepointLocation)
.thenApplyAsync(
(checkpointIdAndStorageLocation) -> createPendingCheckpoint(
timestamp,
props,
ackTasks,
isPeriodic,
checkpointIdAndStorageLocation.checkpointId,
checkpointIdAndStorageLocation.checkpointStorageLocation,
onCompletionPromise),
timer);
//在每个算子已经完成的checkpoint数量
final CompletableFuture<?> masterStatesComplete = pendingCheckpointCompletableFuture
.thenCompose(this::snapshotMasterState);
//因为一个CheckpointCoordinator是管理一条算子链所有算子checkpoint完成的数量,所以此处的checkpoint完成数,是指一条算子链的所有subtask checkpoint的完成数
final CompletableFuture<?> coordinatorCheckpointsComplete = pendingCheckpointCompletableFuture
.thenComposeAsync((pendingCheckpoint) ->
//触发checkpoint,并将checkpoint的进行状态同步给JobMaster OperatorCoordinatorCheckpoints.triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion(
coordinatorsToCheckpoint, pendingCheckpoint, timer),
timer);
//异步提交为Jobmaster上注册的checkpoint,进行状态同步
CompletableFuture.allOf(masterStatesComplete, coordinatorCheckpointsComplete)
.whenCompleteAsync(
(ignored, throwable) -> {
final PendingCheckpoint checkpoint =
FutureUtils.getWithoutException(pendingCheckpointCompletableFuture);
if (throwable == null && checkpoint != null && !checkpoint.isDiscarded()) {
// no exception, no discarding, everything is OK
//生成checkpoint快照
snapshotTaskState(
timestamp,
checkpoint.getCheckpointId(),
checkpoint.getCheckpointStorageLocation(),
props,
executions,
advanceToEndOfTime);
onTriggerSuccess();
} else {
// the initialization might not be finished yet
if (checkpoint == null) {
onTriggerFailure(onCompletionPromise, throwable);
} else {
onTriggerFailure(checkpoint, throwable);
}
}
},
timer);
} catch (Throwable throwable) {
onTriggerFailure(onCompletionPromise, throwable);
}
}
向JobMaster注册checkpoint信息
CheckpointCoordinator#startTriggeringCheckpoint
1.先从checkpoint目录生成savePoint目录
2.提交给Scheduler异步执行task,并还原每个成功在JobMaster注册的task。为每个task加载相应的savepoint(也就是之前生成的)
3.根据执行task的回调结果,判断是否成功,如果不成功则由scheduler进行尝试
final CompletableFuture<PendingCheckpoint> pendingCheckpointCompletableFuture =
initializeCheckpoint(props, externalSavepointLocation)
.thenApplyAsync(
(checkpointIdAndStorageLocation) -> createPendingCheckpoint(
timestamp,
props,
ackTasks,
isPeriodic,
checkpointIdAndStorageLocation.checkpointId,
checkpointIdAndStorageLocation.checkpointStorageLocation,
onCompletionPromise),
timer);
创建待提交Scheduler的Checkpoint
1.进行一些的预检查,保证CheckpointCoordinator实例内的资源锁定,以保护检查点更新。
2.创建待准备的PendingCheckpoint,即为已经在JobMaster中注册且提交scheduler进行checkpoint的挂起状态检查点
3.统计进行中的CheckPoint数量
4.定时执行checkpoint,将取消程序的句柄设置为处于Pending中的检查点。
5.如果进行checkpoint的过程当中发生异常,则取消掉这个checkpoint的异步回调任务
private PendingCheckpoint createPendingCheckpoint(
long timestamp,
CheckpointProperties props,
Map<ExecutionAttemptID, ExecutionVertex> ackTasks,
boolean isPeriodic,
long checkpointID,
CheckpointStorageLocation checkpointStorageLocation,
CompletableFuture<CompletedCheckpoint> onCompletionPromise) {
synchronized (lock) {
try {
// //由于尚未创建PendingCheckpoint,因此我们需要检查这里的全局锁状态。
preCheckGlobalState(isPeriodic);
} catch (Throwable t) {
throw new CompletionException(t);
}
}
//创建待准备的PendingCheckpoint
final PendingCheckpoint checkpoint = new PendingCheckpoint(
job,
checkpointID,
timestamp,
ackTasks,
OperatorCoordinatorCheckpointContext.getIds(coordinatorsToCheckpoint),
masterHooks.keySet(),
props,
checkpointStorageLocation,
executor,
onCompletionPromise);
if (statsTracker != null) {
//统计进行中的CheckPoint数量
PendingCheckpointStats callback = statsTracker.reportPendingCheckpoint(
checkpointID,
timestamp,
props);
checkpoint.setStatsCallback(callback);
}
synchronized (lock) {
pendingCheckpoints.put(checkpointID, checkpoint);
//定时执行checkpoint,将取消程序的句柄设置为处于Pending中的检查点。
ScheduledFuture<?> cancellerHandle = timer.schedule(
new CheckpointCanceller(checkpoint),
checkpointTimeout, TimeUnit.MILLISECONDS);
//如果进行checkpoint的过程发生异常,则取消掉这个checkpoint的异步回调任务
if (!checkpoint.setCancellerHandle(cancellerHandle)) {
// checkpoint is already disposed!
cancellerHandle.cancel(false);
}
}
LOG.info("Triggering checkpoint {} @ {} for job {}.", checkpointID, timestamp, job);
return checkpoint;
}
那么如果在执行定时更新task的状态的时候,有异常了,要怎么办呢?
1.如果存在检查点状态,则将其重新加载到执行中
2.启用新的检查点触发,而无需等待最后一个检查点过期。如果需要,请确保EXACTLY_ONCE语义。
3.从ExecutionGraph获取绑定的算子,收集每个算子待恢复的task (比如 Custsom Source(1/2))
4.尝试通过加载savePoint为每个task进行恢复
SchedulerBase#restoteState
protected void restoreState(final Set<ExecutionVertexID> vertices) throws Exception {
// 如果存在检查点状态,则将其重新加载到执行中
if (executionGraph.getCheckpointCoordinator() != null) {
// abort pending checkpoints to
// i) 启用新的检查点触发,而无需等待最后一个检查点过期。
// ii) 如果需要,请确保EXACTLY_ONCE语义。
executionGraph.getCheckpointCoordinator().abortPendingCheckpoints(
new CheckpointException(CheckpointFailureReason.JOB_FAILOVER_REGION));
executionGraph.getCheckpointCoordinator().restoreLatestCheckpointedState(
getInvolvedExecutionJobVertices(vertices),
false,
true);
}
}
启用新的检查点触发,中断所有因为异常终止的所有处于Pending状态的checkpoint
CheckpointCoordinator#abortPendingCheckpoints
1.当前线程监视指定对象上的锁定。
2.对更新task的State时,如果提交给scheduler,并且未收到完成的通知的PendingCheckPoint有以下三种情况:
1) 暂停/终止作业时使用的保存点
2) 如果此时有的Job的执行计划JobGraph已经翻译成ExecutionGraph,则,根据checkPoint失败次数降序(即不同等级失败的checkPoint)进行处理相关的checkPoint,
3) 如果有的JobGraph还未翻译成ExecutionGraph,则通过CheckpointCoordinator内部的CheckpointScheduler定时器,为其定期进行checkPoint
3.如果经过上述流程再次进行checkpoint进行一定次数限制的恢复,还是没办法那就只能将这个checkPoint移除了
4.标记这些总是失败的checkPoint,从触发计时器触发检查点,以在开始下一个检查点之前完成此线程的工作
/**
* 由于异常中止所有挂起的检查点。
* @param exception The exception.
*/
public void abortPendingCheckpoints(CheckpointException exception) {
synchronized (lock) {
abortPendingCheckpoints(ignored -> true, exception);
}
}
private void abortPendingCheckpoint(
PendingCheckpoint pendingCheckpoint,
CheckpointException exception,
@Nullable final ExecutionAttemptID executionAttemptID) {
//当前线程监视指定对象上的锁定。
assert(Thread.holdsLock(lock));
if (!pendingCheckpoint.isDiscarded()) {
try {
//如果中断因为异常终止的PedningCheckPoint,则进行释放资源
pendingCheckpoint.abort(
exception.getCheckpointFailureReason(), exception.getCause());
//如果该尚未提交给scheduler的PendingCheckPoint属于以下情况
//@link {CheckpointType} 暂停/终止作业时使用的保存点。(isSynchronous与为isSavepointtrue)
if (pendingCheckpoint.getProps().isSavepoint() &&
pendingCheckpoint.getProps().isSynchronous()) {
//通知JobMaster这些保存点已经失败
failureManager.handleSynchronousSavepointFailure(exception);
} else if (executionAttemptID != null) {
//如果此时有作业的执行计划JobGraph已经翻译为ExecutionGraph,那么对ExecutionGraph处理不同等级的失败的checkPoint,因为之前为每个task从checkPoint中恢复的时候会统计对应的checkpoint失败次数,根据次数降序(即不同等级失败的checkPoint)进行处理相关的checkPoint
failureManager.handleTaskLevelCheckpointException(
exception, pendingCheckpoint.getCheckpointId(), executionAttemptID);
} else {
//如果此时有作业的执行计划JobGraph还没有被翻译为ExecutionGraph,说明在翻译的过程中Job就执行失败了。
//在CheckpointCoordinator内部会有一个CheckpointScheduler定时器,为其定期进行checkPoint
failureManager.handleJobLevelCheckpointException(
exception, pendingCheckpoint.getCheckpointId());
}
} finally {
//如果经过上述流程再次进行checkpoint进行一定次数限制的恢复,还是没办法那就只能将这个checkPoint移除了
pendingCheckpoints.remove(pendingCheckpoint.getCheckpointId());
//标记这些总是失败的checkPoint
rememberRecentCheckpointId(pendingCheckpoint.getCheckpointId());
//从触发计时器触发检查点,以在开始下一个检查点之前完成此线程的工作
resumePeriodicTriggering();
}
}
}
收集每个算子待恢复的task (比如 Custsom Source(1/2)),尝试通过加载savePoint为每个task进行恢复
CheckpointCoordinator#restoreLatestCheckpointedState
1.设置一个全局锁,保证Coordinator程序范围内的锁定,以保护检查点更新。设置volitale语义的checkpoint停止标志位,
检查锁内部是否被关闭,否则会遇到竞争和无效的错误日志消息
2.创建一个新的共享状态注册表对象,以便所有来自先前的未处理的异步处理请求,运行将与旧对象冲突(它们不会造成任何伤害)。
,必须在检查点锁定下发生。
3.恢复检查点,从步骤2这个新注册的checkpoint存储中,重新注册所有共享的状态
4.重新恢复最近的checkpoint,并将归属于同一个task的operator state收集起来,重新分配给每一个task
5.从注册的checkpoint钩子事件中进行恢复,并更新追踪的checkpoint信息
/**
* Restores the latest checkpointed state.
* 恢复最新的检查点状态。
*
* @param tasks Set of job vertices to restore. State for these vertices is
* restored via {@link Execution#setInitialState(JobManagerTaskRestore)}.
* Tasks要还原的作业顶点集。这些顶点的状态是
* *通过{@link Execution#setInitialState(JobManagerTaskRestore)}恢复。
*
* @param errorIfNoCheckpoint Fail if no completed checkpoint is available to
* restore from.
* 如果没有可用的完整检查点则还原失败。
*
* @param allowNonRestoredState Allow checkpoint state that cannot be mapped
* to any job vertex in tasks.
* 允许无法映射的检查点状态到任务中的任何作业顶点。
*
* @return <code>true</code> if state was restored, <code>false</code> otherwise.
* @throws IllegalStateException If the CheckpointCoordinator is shut down.
* @throws IllegalStateException If no completed checkpoint is available and
* the <code>failIfNoCheckpoint</code> flag has been set.
* @throws IllegalStateException If the checkpoint contains state that cannot be
* mapped to any job vertex in <code>tasks</code> and the
* <code>allowNonRestoredState</code> flag has not been set.
* @throws IllegalStateException If the max parallelism changed for an operator
* that restores state from this checkpoint.
* @throws IllegalStateException If the parallelism changed for an operator
* that restores <i>non-partitioned</i> state from this
* checkpoint.
*/
public boolean restoreLatestCheckpointedState(
final Set<ExecutionJobVertex> tasks,
final boolean errorIfNoCheckpoint,
final boolean allowNonRestoredState) throws Exception {
//Coordinator程序范围内的锁定,以保护检查点更新。
synchronized (lock) {
// we need to check inside the lock for being shutdown as well, otherwise we
// get races and invalid error log messages
//我们还需要检查锁内部是否被关闭,否则我们会遇到竞争和无效的错误日志消息
if (shutdown) {
throw new IllegalStateException("CheckpointCoordinator is shut down");
}
// We create a new shared state registry object, so that all pending async disposal requests from previous
// runs will go against the old object (were they can do no harm).
// This must happen under the checkpoint lock.
//我们创建一个新的共享状态注册表对象,以便所有来自先前的未处理的异步处理请求
//运行将与旧对象冲突(它们不会造成任何伤害)。
//这必须在检查点锁定下发生。
sharedStateRegistry.close();
sharedStateRegistry = sharedStateRegistryFactory.create(executor);
// Recover the checkpoints, TODO this could be done only when there is a new leader, not on each recovery
//恢复检查点
//3.在checkpoint失败后执行恢复检查点
completedCheckpointStore.recover();
// Now, we re-register all (shared) states from the checkpoint store with the new registry
//现在我们从这个新注册的checkpoint存储中,重新注册所有共享的状态
for (CompletedCheckpoint completedCheckpoint : completedCheckpointStore.getAllCheckpoints()) {
completedCheckpoint.registerSharedStatesAfterRestored(sharedStateRegistry);
}
LOG.debug("Status of the shared state registry of job {} after restore: {}.", job, sharedStateRegistry);
// Restore from the latest checkpoint
//重新恢复最近的checkpoint
CompletedCheckpoint latest = completedCheckpointStore.getLatestCheckpoint(isPreferCheckpointForRecovery);
if (latest == null) {
if (errorIfNoCheckpoint) {
throw new IllegalStateException("No completed checkpoint available");
} else {
LOG.debug("Resetting the master hooks.");
//如果没有可以恢复的state,重置Checkpoint的hook时间
MasterHooks.reset(masterHooks.values(), LOG);
return false;
}
}
LOG.info("Restoring job {} from latest valid checkpoint: {}.", job, latest);
// re-assign the task states
//重新分配每个task的state
final Map<OperatorID, OperatorState> operatorStates = latest.getOperatorStates();
StateAssignmentOperation stateAssignmentOperation =
new StateAssignmentOperation(latest.getCheckpointID(), tasks, operatorStates, allowNonRestoredState);
stateAssignmentOperation.assignStates();
// call master hooks for restore
//恢复中回调master hooks
MasterHooks.restoreMasterHooks(
masterHooks,
latest.getMasterHookStates(),
latest.getCheckpointID(),
allowNonRestoredState,
LOG);
// update metrics
//更新跟踪的checkpoint信息
if (statsTracker != null) {
long restoreTimestamp = System.currentTimeMillis();
RestoredCheckpointStats restored = new RestoredCheckpointStats(
latest.getCheckpointID(),
latest.getProperties(),
restoreTimestamp,
latest.getExternalPointer());
statsTracker.reportRestoredCheckpoint(restored);
}
return true;
}
}
挂起checkpoint (Pending Checkpoint)
统计在每个算子已经完成的checkpoint数量
CheckpointCoordinator#snapshotMasterState
1.向JobMaster注册checkPoint任务的钩子函数
2.异步执行每个PendingCheckPoint任务:
1)如果该checkPoint已经过期,则抛出异常。
2)在同步checkPoint任务状态到JobMaster的时候没有发生异常,则通知checkPoint已经完成。
3)当前算子的所有task都同步完成了,则该算子的一次checkPoint标记为已经完成
3.在该算子完成一次checkPoint期间对异常做处理
/**
* 快照主钩子状态是异步的
*
* @param 处于pending(准备中) checkpoint,已经提交scheduler,但是未完成的checkPoint
* @return 这个返回结果代表所有向JobMaster注册的checkPoint任务的完成状况
*/
private CompletableFuture<Void> snapshotMasterState(PendingCheckpoint checkpoint) {
if (masterHooks.isEmpty()) {
return CompletableFuture.completedFuture(null);
}
final long checkpointID = checkpoint.getCheckpointId();
final long timestamp = checkpoint.getCheckpointTimestamp();
final CompletableFuture<Void> masterStateCompletableFuture = new CompletableFuture<>();
for (MasterTriggerRestoreHook<?> masterHook : masterHooks.values()) {
//向JobMaster注册checkPoint任务的钩子函数
MasterHooks
.triggerHook(masterHook, checkpointID, timestamp, executor)
.whenCompleteAsync(
(masterState, throwable) -> {
try {
synchronized (lock) {
if (masterStateCompletableFuture.isDone()) {
return;
}
//如果该checkPoint已经过期
if (checkpoint.isDiscarded()) {
throw new IllegalStateException(
"Checkpoint " + checkpointID + " has been discarded");
}
if (throwable == null) {
//如果在同步checkPoint任务状态到JobMaster的时候没有发生异常,则通知checkPoint已经完成
checkpoint.acknowledgeMasterState(
masterHook.getIdentifier(), masterState);
if (checkpoint.areMasterStatesFullyAcknowledged()) {
//如果当前算子的所有task都同步完成了,则该算子的一次checkPoint标记为已经完成
masterStateCompletableFuture.complete(null);
}
} else {
//在该算子完成一次checkPoint期间对异常做处理
masterStateCompletableFuture.completeExceptionally(throwable);
}
}
} catch (Throwable t) {
//在该算子完成一次checkPoint期间对异常做处理
masterStateCompletableFuture.completeExceptionally(t);
}
},
timer);
}
return masterStateCompletableFuture;
}
统计每条算子链的所有subtask checkpoint的完成数(因为一个CheckpointCoordinator是管理一条算子链所有算子checkpoint完成的数量,所以此处的checkpoint完成数,)
OperatorCoordinatorCheckpoints#triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion
public static CompletableFuture<AllCoordinatorSnapshots> triggerAllCoordinatorCheckpoints(
final Collection<OperatorCoordinatorCheckpointContext> coordinators,
final long checkpointId) throws Exception {
//收集所有算子的checkPoint协调器上下文
final Collection<CompletableFuture<CoordinatorSnapshot>> individualSnapshots = new ArrayList<>(coordinators.size());
//基于算子的CheckPointCoordinator,触发每个算子的checkPoint任务,为每个算子以及相关的checkPoint任务维护映射
for (final OperatorCoordinatorCheckpointContext coordinator : coordinators) {
individualSnapshots.add(triggerCoordinatorCheckpoint(coordinator, checkpointId));
}
//异步执行统计算子快照任务,合并每个算子的checkPoint。返回一条算子链执行checkPoint的完成结果
return FutureUtils.combineAll(individualSnapshots).thenApply(AllCoordinatorSnapshots::new);
}
这里关键的部分在于每个算子协调器如何完成一次checkPoint,怎么实现的呢?
OperatorCoordinatorCheckpoints#triggerCoordinatorCheckpoint
public static CompletableFuture<CoordinatorSnapshot> triggerCoordinatorCheckpoint(
final OperatorCoordinatorCheckpointContext coordinatorInfo,
final long checkpointId) throws Exception {
//根据每个算子协调器,维护相关的checkPoint任务
final CompletableFuture<byte[]> checkpointFuture =
coordinatorInfo.coordinator().checkpointCoordinator(checkpointId);
//生成每个算子协调器的快照
return checkpointFuture.thenApply(
(state) -> new CoordinatorSnapshot(
coordinatorInfo, new ByteStreamStateHandle(coordinatorInfo.operatorId().toString(), state))
);
}
完成checkpoint (Completed Checkpoint)
异步提交为Jobmaster上注册的checkpoint,进行状态同步
CheckpointCoordinator#snapshotTaskState
/**
*为每一个任务生成state的快照,即checkpoint
*
* @param timestamp 此检查点请求的时间戳
* @param checkpointID 此检查点 id
* @param checkpointStorageLocation 次检查点的存储路径
* @param props the checkpoint 配置
* @param executions 应该触发的ExecutionGraph
* @param advanceToEndOfTime 标记,指示来源是否应注入{@code MAX_WATERMARK}(最大水印时间) 在checkpoint的整个生成流水线中触发任何已注册的事件时间计时器。
*/
private void snapshotTaskState(
long timestamp,
long checkpointID,
CheckpointStorageLocation checkpointStorageLocation,
CheckpointProperties props,
Execution[] executions,
boolean advanceToEndOfTime) {
final CheckpointOptions checkpointOptions = new CheckpointOptions(
props.getCheckpointType(),
checkpointStorageLocation.getLocationReference());
// 将消息发送到触发其检查点的任务
for (Execution execution: executions) {
//如果Execution有启用水印控制事件到达时间,那么会影响checkPoint请求的时间戳
if (props.isSynchronous()) {
//触发设置水印时间下的checkPoint
execution.triggerSynchronousSavepoint(checkpointID, timestamp, checkpointOptions, advanceToEndOfTime);
} else {
//触发没设置水印时间下的checkPoint
execution.triggerCheckpoint(checkpointID, timestamp, checkpointOptions);
}
}
}
展开两个方法后,会发现调用了同一个公共方法
Execution#triggerCheckpointHelper
private LogicalSlot assignedResource;
private void triggerCheckpointHelper(long checkpointId, long timestamp, CheckpointOptions checkpointOptions, boolean advanceToEndOfEventTime) {
final CheckpointType checkpointType = checkpointOptions.getCheckpointType();
if (advanceToEndOfEventTime && !(checkpointType.isSynchronous() && checkpointType.isSavepoint())) {
throw new IllegalArgumentException("Only synchronous savepoints are allowed to advance the watermark to MAX.");
}
//分别提供一个solt给当前需要执行的checkPoint任务(这里的内容在后续的并行度机制中会介绍)
final LogicalSlot slot = assignedResource;
if (slot != null) {
final TaskManagerGateway taskManagerGateway = slot.getTaskManagerGateway();
//在分配的solt所在的taskManager节点上触发checkPoint
taskManagerGateway.triggerCheckpoint(attemptId, getVertex().getJobId(), checkpointId, timestamp, checkpointOptions, advanceToEndOfEventTime);
} else {
LOG.debug("The execution has no slot assigned. This indicates that the execution is no longer running.");
}
}
在第一篇文章Flink源码阅读(一) Runtime机制中 我们已经介绍过相关架构
每次执行完一个task,就会生成一个checkPoint
@Override
public CompletableFuture<Acknowledge> triggerCheckpoint(
ExecutionAttemptID executionAttemptID,
long checkpointId,
long checkpointTimestamp,
CheckpointOptions checkpointOptions,
boolean advanceToEndOfEventTime) {
log.debug("Trigger checkpoint {}@{} for {}.", checkpointId, checkpointTimestamp, executionAttemptID);
final CheckpointType checkpointType = checkpointOptions.getCheckpointType();
//是否设置水印时间且checkPoint开启向Jobmaster同步检查点,并尝试从checkPoint同步数据目录到savePoint
if (advanceToEndOfEventTime && !(checkpointType.isSynchronous() && checkpointType.isSavepoint())) {
throw new IllegalArgumentException("Only synchronous savepoints are allowed to advance the watermark to MAX.");
}
//因为上文代码已提供一个solt给当前需要执行的checkPoint任务,所以在task的槽位注册表中存在一个已经分配的Task
final Task task = taskSlotTable.getTask(executionAttemptID);
if (task != null) {
//每个Task生成一个barrier,这个在第二章中我们有展开相关流程
task.triggerCheckpointBarrier(checkpointId, checkpointTimestamp, checkpointOptions, advanceToEndOfEventTime);
return CompletableFuture.completedFuture(Acknowledge.get());
} else {
final String message = "TaskManager received a checkpoint request for unknown task " + executionAttemptID + '.';
log.debug(message);
return FutureUtils.completedExceptionally(new CheckpointException(message, CheckpointFailureReason.TASK_CHECKPOINT_FAILURE));
}
}