checkpoint(检查点机制)在大数据体系中出现的比较频繁,所以分为几个章节讲起
让我们在上文示例程序中添加如下代码
env.setParallelism(1);//设置并行度为1
env.enableCheckpointing(10000); //默认是不开启的
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE); //默认为EXACTLY_ONCE
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);//默认为0,最大值为1年
env.getCheckpointConfig().setCheckpointTimeout(150000);//设置checkpoint超时时间默认为10min
/**
*设置可能同时进行的最大检查点尝试次数。如果这
*值为<i> n </ i>,则在尝试<i> n </ i>个检查点时不会触发任何检查点
*目前正在飞行中。为了触发下一个检查点,需要尝试一次检查点
*完成或到期。
**/
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);//默认为1
同样还是跟读相关的flink checkpoint文档: ci.apache.org/projects/fl…
checkpoint官方流程
官方的解释为当作业图中的每个operator都收到这些障碍之一时,它会记录其状态。具有两个输入流(例如CoProcessFunction)的运算符执行屏障对齐,以便快照将反映由于消耗两个输入流中的事件直至(但不超过)两个屏障而导致的状态。
疑问有下:
- 那为什么在一个输入流的情况下也有checkpoint,如果是的话,是怎么生成checkpint快照的
- 假设一条数据落盘失败了,checkpoint能否支持从故障中恢复
- checckpoint保证一致性是指状态(state)的一致性,还是指数据的一致性?
- 这里说的buffers在源码层面指代什么?
疑问1: 为什么在一个输入流的情况下也有checkpoint,如果是的话,是怎么生成checkpint快照的
任务调度过程中Checkpoint的产生 www.processon.com/view/link/6…
我们来到时序: JobMaster#createSchduler方法中。跟踪堆栈调用如下方法
在启动任务调度流程后,执行操作分为以下三步
1.创建ExecutionGraph
2.在每个Job生成一块CheckpointCoordinator
3.从最近的Checkpint开始尝试恢复
SchedulerBase
/**
*
* @param log 打印的日志门面
* @param jobGraph JobGraph代表Flink数据流程序,处于JobManager接受的底层。
* 来自更高级别API的所有程序都将转换为JobGraphs。
* @param backPressureStatsTracker {@link ExecutionJobVertex}的背压统计信息跟踪器的界面。
* @param ioExecutor {@link Executor},提供管理终止和终止的方法
* *可以产生{@link Future}用于跟踪进度的方法
* *一个或多个异步任务。
* @param jobMasterConfiguration {@link JobMaster}的配置。
* @param slotProvider slot提供程序负责为准备运行的任务准备插槽。
* 它支持两种分配方式:
* 立即分配:对slot的请求立即得到满足,我们可以调用
* {@link CompletableFuture#getNow(Object)}获取分配的slot。
* 排队分配:对slot的请求已排队,并返回将来的slot可用后立即履行.
* @param futureExecutor {@link ExecutorService}可以安排命令在给定时间后运行延迟,或定期执行。
* @param userCodeLoader {@link FlinkUserCodeClassLoaders.ResolveOrder.fromString)} 可选择是从子类或者父类开始加载
* @param checkpointRecoveryFactory 一个per Job模式下 checkpoint恢复的工厂实例
* @param rpcTimeout 调度器的调度时长配置
* @param restartStrategyFactory //自定义重启策略工厂
* @param blobWriter 写入大文件的实例
* @param jobManagerJobMetricGroup JobManager端的监控指标
* @param slotRequestTimeout 调度器接受每个TaskManger中slot的请求时长
* @param shuffleMaster 要在{@link org.apache.flink.runtime.jobmaster.JobMaster}中使用的中间结果分区注册表。
* @param partitionTracker 跟踪分区并向任务执行者和随机主机发出发布调用的实用程序。
* @param executionVertexVersioner 记录修改
* * {@link org.apache.flink.runtime.executiongraph.ExecutionVertex ExecutionVertices},并允许
* *用于检查顶点是否被修改。
* @param legacyScheduling 判断是否启用{@link LegacyScheduler} 委托给{@link ExecutionGraph}中的调度逻辑的调度程序。
* @throws Exception
*/
public SchedulerBase(
final Logger log,
final JobGraph jobGraph,
final BackPressureStatsTracker backPressureStatsTracker,
final Executor ioExecutor,
final Configuration jobMasterConfiguration,
final SlotProvider slotProvider,
final ScheduledExecutorService futureExecutor,
final ClassLoader userCodeLoader,
final CheckpointRecoveryFactory checkpointRecoveryFactory,
final Time rpcTimeout,
final RestartStrategyFactory restartStrategyFactory,
final BlobWriter blobWriter,
final JobManagerJobMetricGroup jobManagerJobMetricGroup,
final Time slotRequestTimeout,
final ShuffleMaster<?> shuffleMaster,
final JobMasterPartitionTracker partitionTracker,
final ExecutionVertexVersioner executionVertexVersioner,
final boolean legacyScheduling) throws Exception {
this.log = checkNotNull(log);
this.jobGraph = checkNotNull(jobGraph);
this.backPressureStatsTracker = checkNotNull(backPressureStatsTracker);
this.ioExecutor = checkNotNull(ioExecutor);
this.jobMasterConfiguration = checkNotNull(jobMasterConfiguration);
this.slotProvider = checkNotNull(slotProvider);
this.futureExecutor = checkNotNull(futureExecutor);
this.userCodeLoader = checkNotNull(userCodeLoader);
this.checkpointRecoveryFactory = checkNotNull(checkpointRecoveryFactory);
this.rpcTimeout = checkNotNull(rpcTimeout);
..........................
//创建ExecutionGraph,并尝试从Checkpoint恢复
this.executionGraph = createAndRestoreExecutionGraph(jobManagerJobMetricGroup, checkNotNull(shuffleMaster), checkNotNull(partitionTracker));
.........................
}
private ExecutionGraph createAndRestoreExecutionGraph(
JobManagerJobMetricGroup currentJobManagerJobMetricGroup,
ShuffleMaster<?> shuffleMaster,
JobMasterPartitionTracker partitionTracker) throws Exception {
//1.创建ExecutionGraph
ExecutionGraph newExecutionGraph = createExecutionGraph(currentJobManagerJobMetricGroup, shuffleMaster, partitionTracker);
//2.在每个Job生成一块CheckpointCoordinator
final CheckpointCoordinator checkpointCoordinator = newExecutionGraph.getCheckpointCoordinator();
if (checkpointCoordinator != null) {
// check whether we find a valid checkpoint
//3.从最近的Checkpint开始尝试恢复
if (!checkpointCoordinator.restoreLatestCheckpointedState(
new HashSet<>(newExecutionGraph.getAllVertices().values()),
false,
false)) {
// check whether we can restore from a savepoint
tryRestoreExecutionGraphFromSavepoint(newExecutionGraph, jobGraph.getSavepointRestoreSettings());
}
}
return newExecutionGraph;
}
1.创建ExecutionGraph
SchedulerBase#createExecutionGraph
/**
*
* @param currentJobManagerJobMetricGroup 特别代表所有属于在JobManager上运行的特定作业。
* @param shuffleMaster 中间结果分区注册表,用于生产者/消费者部署及其数据交换的分区 shuffle的描述符中。
* @param partitionTracker 跟踪分区并向任务执行者和随机主机发出发布调用的实用程序。
* @return
* @throws JobExecutionException
* @throws JobException
*/
private ExecutionGraph createExecutionGraph(
JobManagerJobMetricGroup currentJobManagerJobMetricGroup,
ShuffleMaster<?> shuffleMaster,
final JobMasterPartitionTracker partitionTracker) throws JobExecutionException, JobException {
//定义失败重试策略,默认是没有失败重试
final FailoverStrategy.Factory failoverStrategy = legacyScheduling ?
FailoverStrategyLoader.loadFailoverStrategy(jobMasterConfiguration, log) :
new NoOpFailoverStrategy.Factory();
//生成ExecutionGraph
return ExecutionGraphBuilder.buildGraph(
null,
jobGraph,
jobMasterConfiguration,
futureExecutor,
ioExecutor,
slotProvider,
userCodeLoader,
checkpointRecoveryFactory,
rpcTimeout,
restartStrategy,
currentJobManagerJobMetricGroup,
blobWriter,
slotRequestTimeout,
log,
shuffleMaster,
partitionTracker,
failoverStrategy);
}
将JobGraph翻译成ExecutionGraph
1.根据JobGraph生成算子图的执行计划,附加到ExecutionGraph上
2.初始化注册checkpoint的钩子函数
3.反射JobGraph的顶点,也就是算子的类名.获取当前执行Job的上下文类加载器,为每个算子加载对应的类
4.对作业顶点进行拓扑排序,并将原先map <JobId,JobVertex> 中每个顶点的边信息进行抽取,标记点与边的关系。将排序好的Job图信息附加到ExecutionGraph的属性当中
5. 配置触发Job checkpoint动作时候的快照.根据在Job checkpoint的生命周期,分为三部分.接收触发检查点消息的顶点(CheckPointCoordinator#triggerCheckpoint事件)。收集需要确认检查点的顶点(CheckPointCoordinator#receiveAcknowledgeMessage)收集需要提交检查点的顶点 (CheckPointCoordinator#sendAcknowledgeMessages 事件)
6.保留的最大已完成检查点数,当失败的task从checkpoint进行恢复时在一次重试中能够设置完成的最大检查点个数.并根据当前Job的类加载器加载已经完成的检查点
7.从提交的应用程序中获取state的存储后端设置,从应用程序加载state后端元数据,后续检查点准备开启配置
8.注册用户自定义的Checkpoint钩子,每个钩子事件中存在一个双端队列
9.绑定当前Job的类加载器到线程上下文加载器中,为了用当前线程注册钩子,而不跟已经注册完钩子,准备执行checkPoint的线程冲突。当注册完钩子,清空注册钩子函数所要加载的方法和类,继续执行下一个检查点的钩子函数注册事件
10.当输入流到达CheckpointCoordinator发生快照之前,设置生产快照设置
11.开启checkpoint
/**
* 将JobGraph翻译成Execution
* @param prior ExecutionGraph是协调分布式数据库的中央数据结构
* 数据流的执行。它保留每个并行任务的表示形式,每个
* 中间流,以及它们之间的通信。
* @param jobGraph JobGraph是顶点和中间结果的图,它们连接在一起
* 组成DAG。注意,迭代(反馈边)当前未在JobGraph内编码
* 但是在某些特殊的顶点内部,这些顶点之间建立了反馈通道。
* @param jobManagerConfig 保存JobManager的一些配置
* @param futureExecutor 可以安排命令在给定时间后运行
* 延迟或定期执行。
* @param ioExecutor 执行已提交的{@link Runnable}任务的对象。这个
* 接口提供了一种将任务提交与
* 如何运行每个任务的机制,包括线程的详细信息使用,安排
* @param slotProvider slot提供程序负责为准备运行的任务准备插槽。
* @param classLoader 当前执行线程的classLoader
* @param recoveryFactory 尝试策略
* @param rpcTimeout 设置state的TTL生存时间
* @param restartStrategy 设置Job失败时的重启策略
* @param metrics task的Metrics指标
* @param blobWriter 大文件写入的实例对象
* @param allocationTimeout 度量指标
* @param log
* @param shuffleMaster 中间结果分区注册表,用于生产者/消费者部署及其数据交换的分区 shuffle的描述符中。
* @param partitionTracker 跟踪分区并向任务执行者和 产生shuffle的master节点 发出发布调用的实用程序。
* @param failoverStrategyFactory 失败重试策略的工厂实例
* @return
* @throws JobExecutionException
* @throws JobException
*/
public static ExecutionGraph buildGraph(
@Nullable ExecutionGraph prior,
JobGraph jobGraph,
Configuration jobManagerConfig,
ScheduledExecutorService futureExecutor,
Executor ioExecutor,
SlotProvider slotProvider,
ClassLoader classLoader,
CheckpointRecoveryFactory recoveryFactory,
Time rpcTimeout,
RestartStrategy restartStrategy,
MetricGroup metrics,
BlobWriter blobWriter,
Time allocationTimeout,
Logger log,
ShuffleMaster<?> shuffleMaster,
JobMasterPartitionTracker partitionTracker,
FailoverStrategy.Factory failoverStrategyFactory) throws JobExecutionException, JobException {
checkNotNull(jobGraph, "job graph cannot be null");
final String jobName = jobGraph.getName();
final JobID jobId = jobGraph.getJobID();
final JobInformation jobInformation = new JobInformation(
jobId,
jobName,
jobGraph.getSerializedExecutionConfig(),
jobGraph.getJobConfiguration(),
jobGraph.getUserJarBlobKeys(),
jobGraph.getClasspaths());
final int maxPriorAttemptsHistoryLength =
jobManagerConfig.getInteger(JobManagerOptions.MAX_ATTEMPTS_HISTORY_SIZE);
final PartitionReleaseStrategy.Factory partitionReleaseStrategyFactory =
PartitionReleaseStrategyFactoryLoader.loadPartitionReleaseStrategyFactory(jobManagerConfig);
// create a new execution graph, if none exists so far
final ExecutionGraph executionGraph;
try {
executionGraph = (prior != null) ? prior :
new ExecutionGraph(
jobInformation,
futureExecutor,
ioExecutor,
rpcTimeout,
restartStrategy,
maxPriorAttemptsHistoryLength,
failoverStrategyFactory,
slotProvider,
classLoader,
blobWriter,
allocationTimeout,
partitionReleaseStrategyFactory,
shuffleMaster,
partitionTracker,
jobGraph.getScheduleMode());
} catch (IOException e) {
throw new JobException("Could not create the ExecutionGraph.", e);
}
// set the basic properties
try {
//根据JobGraph生成执行计划,在根据需求定制化SQL优化规则时,可以写单元测试进行跟踪语法转换
executionGraph.setJsonPlan(JsonPlanGenerator.generatePlan(jobGraph));
}
catch (Throwable t) {
log.warn("Cannot create JSON plan for job", t);
// give the graph an empty plan
executionGraph.setJsonPlan("{}");
}
// initialize the vertices that have a master initialization hook
// file output formats create directories here, input formats create splits
//初始化注册checkpoint的钩子函数
final long initMasterStart = System.nanoTime();
log.info("Running initialization on master for job {} ({}).", jobName, jobId);
for (JobVertex vertex : jobGraph.getVertices()) {
//反射JobGraph的顶点,也就是算子的类名
String executableClass = vertex.getInvokableClassName();
if (executableClass == null || executableClass.isEmpty()) {
throw new JobSubmissionException(jobId,
"The vertex " + vertex.getID() + " (" + vertex.getName() + ") has no invokable class.");
}
try {
//获取当前执行Job的上下文类加载器,为每个算子加载对应的类
vertex.initializeOnMaster(classLoader);
}
catch (Throwable t) {
throw new JobExecutionException(jobId,
"Cannot initialize task '" + vertex.getName() + "': " + t.getMessage(), t);
}
}
log.info("Successfully ran initialization on master in {} ms.",
(System.nanoTime() - initMasterStart) / 1_000_000);
// topologically sort the job vertices and attach the graph to the existing one
//对作业顶点进行拓扑排序,并将原先map <JobId,JobVertex> 中每个顶点的边信息进行抽取,标记点与边的关系
List<JobVertex> sortedTopology = jobGraph.getVerticesSortedTopologicallyFromSources();
if (log.isDebugEnabled()) {
log.debug("Adding {} vertices from job graph {} ({}).", sortedTopology.size(), jobName, jobId);
}
//将排序好的Job图信息附加到ExecutionGraph的属性当中
executionGraph.attachJobGraph(sortedTopology);
if (log.isDebugEnabled()) {
log.debug("Successfully created execution graph from job graph {} ({}).", jobName, jobId);
}
// configure the state checkpointing
//配置触发Job checkpoint动作时候的快照
JobCheckpointingSettings snapshotSettings = jobGraph.getCheckpointingSettings();
if (snapshotSettings != null) {
//根据在Job checkpoint的生命周期,分为三部分
//接收触发检查点消息的顶点(CheckPointCoordinator#triggerCheckpoint事件)
List<ExecutionJobVertex> triggerVertices =
idToVertex(snapshotSettings.getVerticesToTrigger(), executionGraph);
//收集需要确认检查点的顶点(CheckPointCoordinator#receiveAcknowledgeMessage)
List<ExecutionJobVertex> ackVertices =
idToVertex(snapshotSettings.getVerticesToAcknowledge(), executionGraph);
//手机需要提交检查点的顶点 (CheckPointCoordinator#sendAcknowledgeMessages 事件)
List<ExecutionJobVertex> confirmVertices =
idToVertex(snapshotSettings.getVerticesToConfirm(), executionGraph);
//完成检查点。可能会阻塞。确保调用方法访问此选项不会阻止JobManager参与并异步运行。
CompletedCheckpointStore completedCheckpoints;
CheckpointIDCounter checkpointIdCounter;
try {
//保留的最大已完成检查点数,当失败的task从checkpoint进行恢复时在一次重试中能够设置完成的最大检查点个数
int maxNumberOfCheckpointsToRetain = jobManagerConfig.getInteger(
CheckpointingOptions.MAX_RETAINED_CHECKPOINTS);
if (maxNumberOfCheckpointsToRetain <= 0) {
// warning and use 1 as the default value if the setting in
// state.checkpoints.max-retained-checkpoints is not greater than 0.
log.warn("The setting for '{} : {}' is invalid. Using default value of {}",
CheckpointingOptions.MAX_RETAINED_CHECKPOINTS.key(),
maxNumberOfCheckpointsToRetain,
CheckpointingOptions.MAX_RETAINED_CHECKPOINTS.defaultValue());
//如果保留已完成的检查点树小于0默认为1
maxNumberOfCheckpointsToRetain = CheckpointingOptions.MAX_RETAINED_CHECKPOINTS.defaultValue();
}
//根据当前Job的类加载器加载已经完成的检查点
completedCheckpoints = recoveryFactory.createCheckpointStore(jobId, maxNumberOfCheckpointsToRetain, classLoader);
//创建为同一个job创建checkpoint的计数器
checkpointIdCounter = recoveryFactory.createCheckpointIDCounter(jobId);
}
catch (Exception e) {
throw new JobExecutionException(jobId, "Failed to initialize high-availability checkpoint handler", e);
}
// Maximum number of remembered checkpoints
int historySize = jobManagerConfig.getInteger(WebOptions.CHECKPOINTS_HISTORY_SIZE);
CheckpointStatsTracker checkpointStatsTracker = new CheckpointStatsTracker(
historySize,
ackVertices,
snapshotSettings.getCheckpointCoordinatorConfiguration(),
metrics);
// load the state backend from the application settings
//从提交的应用程序中获取state的存储后端设置
final StateBackend applicationConfiguredBackend;
final SerializedValue<StateBackend> serializedAppConfigured = snapshotSettings.getDefaultStateBackend();
if (serializedAppConfigured == null) {
applicationConfiguredBackend = null;
}
else {
try {
//从当前Job的上下文加载器中加载
applicationConfiguredBackend = serializedAppConfigured.deserializeValue(classLoader);
} catch (IOException | ClassNotFoundException e) {
throw new JobExecutionException(jobId,
"Could not deserialize application-defined state backend.", e);
}
}
//从指定state后端获取checkpoint的位置
final StateBackend rootBackend;
try {
//从应用程序加载state后端元数据,后续检查点开启配置准备
rootBackend = StateBackendLoader.fromApplicationOrConfigOrDefault(
applicationConfiguredBackend, jobManagerConfig, classLoader, log);
}
catch (IllegalConfigurationException | IOException | DynamicCodeLoadingException e) {
throw new JobExecutionException(jobId, "Could not instantiate configured state backend", e);
}
// instantiate the user-defined checkpoint hooks
final SerializedValue<MasterTriggerRestoreHook.Factory[]> serializedHooks = snapshotSettings.getMasterHooks();
//注册用户自定义的Checkpoint钩子,每个钩子事件中存在一个双端队列
final List<MasterTriggerRestoreHook<?>> hooks;
if (serializedHooks == null) {
hooks = Collections.emptyList();
}
else {
final MasterTriggerRestoreHook.Factory[] hookFactories;
try {
hookFactories = serializedHooks.deserializeValue(classLoader);
}
catch (IOException | ClassNotFoundException e) {
throw new JobExecutionException(jobId, "Could not instantiate user-defined checkpoint hooks", e);
}
//绑定当前Job的类加载器到线程上下文加载器中,为了用当前线程注册钩子,而不跟已经注册完钩子,准备执行checkPoint的线程冲突
final Thread thread = Thread.currentThread();
final ClassLoader originalClassLoader = thread.getContextClassLoader();
thread.setContextClassLoader(classLoader);
try {
hooks = new ArrayList<>(hookFactories.length);
for (MasterTriggerRestoreHook.Factory factory : hookFactories) {
hooks.add(MasterHooks.wrapHook(factory.create(), classLoader));
}
}
finally {
//当注册完钩子,清空注册钩子函数所要加载的方法和类,继续执行下一个检查点的钩子函数注册事件
thread.setContextClassLoader(originalClassLoader);
}
}
//当输入流到达CheckpointCoordinator发生快照之前,设置生产快照设置
final CheckpointCoordinatorConfiguration chkConfig = snapshotSettings.getCheckpointCoordinatorConfiguration();
//此处为重中之重,开启checkpoint的关键方法
executionGraph.enableCheckpointing(
chkConfig,
triggerVertices,
ackVertices,
confirmVertices,
hooks,
checkpointIdCounter,
completedCheckpoints,
rootBackend,
checkpointStatsTracker);
}
// create all the metrics for the Execution Graph
metrics.gauge(RestartTimeGauge.METRIC_NAME, new RestartTimeGauge(executionGraph));
metrics.gauge(DownTimeGauge.METRIC_NAME, new DownTimeGauge(executionGraph));
metrics.gauge(UpTimeGauge.METRIC_NAME, new UpTimeGauge(executionGraph));
executionGraph.getFailoverStrategy().registerMetrics(metrics);
return executionGraph;
}
开启检查点
1.分别接收触发检查点消息的顶点信息,需要确认检查点的顶点信息,需要提交检查点的顶点信息
2.建立checkpointCoordinator的上下文,从JobMaster的
3.定时进行checkpoint,并创建一个checkpointCoordinator触发和提交checkpoint来维护对应的state
4.每隔一段时间通知Job状态是否改变,checkpointCoordinator作为一个被观察者注册,观察者代码见(JobManagerJobStatusListener#jobStatusChanges)
ExecutionGraph# enableCheckpointing
/**
*
* @param chkConfig {@link CheckpointCoordinator}的配置设置。这包括检查点
* *间隔,检查点超时,检查点之间的暂停,最大次数
* *并发检查点和外部检查点的设置。
* @param verticesToTrigger 接收触发检查点消息的顶点(CheckPointCoordinator#triggerCheckpoint事件)
* @param verticesToWaitFor 收集需要确认检查点的顶点(CheckPointCoordinator#receiveAcknowledgeMessage)
* @param verticesToCommitTo 收集需要提交检查点的顶点 (CheckPointCoordinator#sendAcknowledgeMessages 事件)
* @param masterHooks 注册的checkpoint事件
* @param checkpointIDCounter 统计一个Job中需要checkpoint懂得task数量
* @param checkpointStore 完成checkpoint的队列
* @param checkpointStateBackend checkpoint指定恢复的后端
* @param statsTracker checkpoint特征信息统计
*/
public void enableCheckpointing(
CheckpointCoordinatorConfiguration chkConfig,
List<ExecutionJobVertex> verticesToTrigger,
List<ExecutionJobVertex> verticesToWaitFor,
List<ExecutionJobVertex> verticesToCommitTo,
List<MasterTriggerRestoreHook<?>> masterHooks,
CheckpointIDCounter checkpointIDCounter,
CompletedCheckpointStore checkpointStore,
StateBackend checkpointStateBackend,
CheckpointStatsTracker statsTracker) {
checkState(state == JobStatus.CREATED, "Job must be in CREATED state");
checkState(checkpointCoordinator == null, "checkpointing already enabled");
//接收触发检查点消息的顶点(CheckPointCoordinator#triggerCheckpoint事件)
ExecutionVertex[] tasksToTrigger = collectExecutionVertices(verticesToTrigger);
//收集需要确认检查点的顶点(CheckPointCoordinator#receiveAcknowledgeMessage)
ExecutionVertex[] tasksToWaitFor = collectExecutionVertices(verticesToWaitFor);
//收集需要提交检查点的顶点 (CheckPointCoordinator#sendAcknowledgeMessages 事件)
ExecutionVertex[] tasksToCommitTo = collectExecutionVertices(verticesToCommitTo);
//建立checkpointCoordinator的上下文
final Collection<OperatorCoordinatorCheckpointContext> operatorCoordinators = buildOpCoordinatorCheckpointContexts();
checkpointStatsTracker = checkNotNull(statsTracker, "CheckpointStatsTracker");
//checkpoint失败管理器
CheckpointFailureManager failureManager = new CheckpointFailureManager(
chkConfig.getTolerableCheckpointFailureNumber(),
new CheckpointFailureManager.FailJobCallback() {
@Override
public void failJob(Throwable cause) {
getJobMasterMainThreadExecutor().execute(() -> failGlobal(cause));
}
@Override
public void failJobDueToTaskFailure(Throwable cause, ExecutionAttemptID failingTask) {
getJobMasterMainThreadExecutor().execute(() -> failGlobalIfExecutionIsStillRunning(cause, failingTask));
}
}
);
checkState(checkpointCoordinatorTimer == null);
//定时进行checkpoint
checkpointCoordinatorTimer = Executors.newSingleThreadScheduledExecutor(
new DispatcherThreadFactory(
Thread.currentThread().getThreadGroup(), "Checkpoint Timer"));
// create the coordinator that triggers and commits checkpoints and holds the state
//创建一个checkpointCoordinator触发和提交checkpoint来维护这个状态
checkpointCoordinator = new CheckpointCoordinator(
jobInformation.getJobId(),
chkConfig,
tasksToTrigger,
tasksToWaitFor,
tasksToCommitTo,
operatorCoordinators,
checkpointIDCounter,
checkpointStore,
checkpointStateBackend,
ioExecutor,
new ScheduledExecutorServiceAdapter(checkpointCoordinatorTimer),
SharedStateRegistry.DEFAULT_FACTORY,
failureManager);
// register the master hooks on the checkpoint coordinator
//注册checkpoint的钩子函数
for (MasterTriggerRestoreHook<?> hook : masterHooks) {
if (!checkpointCoordinator.addMasterHook(hook)) {
LOG.warn("Trying to register multiple checkpoint hooks with the name: {}", hook.getIdentifier());
}
}
checkpointCoordinator.setCheckpointStatsTracker(checkpointStatsTracker);
// interval of max long value indicates disable periodic checkpoint,
// the CheckpointActivatorDeactivator should be created only if the interval is not max value
//每隔一段时间通知Job状态是否改变,checkpointCoordinator作为一个被观察者注册,观察者代码见(JobManagerJobStatusListener#jobStatusChanges)
if (chkConfig.getCheckpointInterval() != Long.MAX_VALUE) {
// the periodic checkpoint scheduler is activated and deactivated as a result of
// job status changes (running -> on, all other states -> off)
registerJobStatusListener(checkpointCoordinator.createActivatorDeactivator());
}
this.stateBackendName = checkpointStateBackend.getClass().getSimpleName();
}
2.在每个Job生成一块CheckpointCoordinator
此阶段已经在第一阶段创建ExecutionGrapg中生成
3.从最近的Checkpint开始尝试恢复
这个阶段可分为两部分
1.检查是否有不成功的checkpoint
2.如果有存在不成功的checkpoint,则尝试用savePoint开始恢复
检查是否有不成功的checkpoint
1.设置一个全局锁,保证Coordinator程序范围内的锁定,以保护检查点更新。设置volitale语义的checkpoint停止标志位,
检查锁内部是否被关闭,否则会遇到竞争和无效的错误日志消息
2.创建一个新的共享状态注册表对象,以便所有来自先前的未处理的异步处理请求,运行将与旧对象冲突(它们不会造成任何伤害)。
,必须在检查点锁定下发生。
3.恢复检查点,从步骤2这个新注册的checkpoint存储中,重新注册所有共享的状态
4.重新恢复最近的checkpoint,并将归属于同一个task的operator state收集起来,重新分配给每一个task
5.从注册的checkpoint钩子事件中进行恢复,并更新追踪的checkpoint信息
/**
* Restores the latest checkpointed state.
* 恢复最新的检查点状态。
*
* @param tasks Set of job vertices to restore. State for these vertices is
* restored via {@link Execution#setInitialState(JobManagerTaskRestore)}.
* Tasks要还原的作业顶点集。这些顶点的状态是
* *通过{@link Execution#setInitialState(JobManagerTaskRestore)}恢复。
*
* @param errorIfNoCheckpoint Fail if no completed checkpoint is available to
* restore from.
* 如果没有可用的完整检查点则还原失败。
*
* @param allowNonRestoredState Allow checkpoint state that cannot be mapped
* to any job vertex in tasks.
* 允许无法映射的检查点状态到任务中的任何作业顶点。
*
* @return <code>true</code> if state was restored, <code>false</code> otherwise.
* @throws IllegalStateException If the CheckpointCoordinator is shut down.
* @throws IllegalStateException If no completed checkpoint is available and
* the <code>failIfNoCheckpoint</code> flag has been set.
* @throws IllegalStateException If the checkpoint contains state that cannot be
* mapped to any job vertex in <code>tasks</code> and the
* <code>allowNonRestoredState</code> flag has not been set.
* @throws IllegalStateException If the max parallelism changed for an operator
* that restores state from this checkpoint.
* @throws IllegalStateException If the parallelism changed for an operator
* that restores <i>non-partitioned</i> state from this
* checkpoint.
*/
public boolean restoreLatestCheckpointedState(
final Set<ExecutionJobVertex> tasks,
final boolean errorIfNoCheckpoint,
final boolean allowNonRestoredState) throws Exception {
//Coordinator程序范围内的锁定,以保护检查点更新。
synchronized (lock) {
// we need to check inside the lock for being shutdown as well, otherwise we
// get races and invalid error log messages
//我们还需要检查锁内部是否被关闭,否则我们会遇到竞争和无效的错误日志消息
if (shutdown) {
throw new IllegalStateException("CheckpointCoordinator is shut down");
}
// We create a new shared state registry object, so that all pending async disposal requests from previous
// runs will go against the old object (were they can do no harm).
// This must happen under the checkpoint lock.
//我们创建一个新的共享状态注册表对象,以便所有来自先前的未处理的异步处理请求
//运行将与旧对象冲突(它们不会造成任何伤害)。
//这必须在检查点锁定下发生。
sharedStateRegistry.close();
sharedStateRegistry = sharedStateRegistryFactory.create(executor);
// Recover the checkpoints, TODO this could be done only when there is a new leader, not on each recovery
//恢复检查点
completedCheckpointStore.recover();
// Now, we re-register all (shared) states from the checkpoint store with the new registry
//现在我们从这个新注册的checkpoint存储中,重新注册所有共享的状态
for (CompletedCheckpoint completedCheckpoint : completedCheckpointStore.getAllCheckpoints()) {
completedCheckpoint.registerSharedStatesAfterRestored(sharedStateRegistry);
}
LOG.debug("Status of the shared state registry of job {} after restore: {}.", job, sharedStateRegistry);
// Restore from the latest checkpoint
//重新恢复最近的checkpoint
CompletedCheckpoint latest = completedCheckpointStore.getLatestCheckpoint(isPreferCheckpointForRecovery);
if (latest == null) {
if (errorIfNoCheckpoint) {
throw new IllegalStateException("No completed checkpoint available");
} else {
LOG.debug("Resetting the master hooks.");
//如果没有可以恢复的state,重置Checkpoint的hook时间
MasterHooks.reset(masterHooks.values(), LOG);
return false;
}
}
LOG.info("Restoring job {} from latest valid checkpoint: {}.", job, latest);
// re-assign the task states
//重新分配每个task的state
final Map<OperatorID, OperatorState> operatorStates = latest.getOperatorStates();
StateAssignmentOperation stateAssignmentOperation =
new StateAssignmentOperation(latest.getCheckpointID(), tasks, operatorStates, allowNonRestoredState);
stateAssignmentOperation.assignStates();
// call master hooks for restore
//恢复中回调master hooks
MasterHooks.restoreMasterHooks(
masterHooks,
latest.getMasterHookStates(),
latest.getCheckpointID(),
allowNonRestoredState,
LOG);
// update metrics
//更新跟踪的checkpoint信息
if (statsTracker != null) {
long restoreTimestamp = System.currentTimeMillis();
RestoredCheckpointStats restored = new RestoredCheckpointStats(
latest.getCheckpointID(),
latest.getProperties(),
restoreTimestamp,
latest.getExternalPointer());
statsTracker.reportRestoredCheckpoint(restored);
}
return true;
}
}
如果有存在不成功的checkpoint,则尝试用savePoint开始恢复
1.从当前task判断是否有对应的savepoint 2.从state后端加载已完成的保存点,重置保存点计数器 3.不断从保存点开始恢复
/**
*尝试从 {@link ExecutionGraph} 的 {@link SavepointRestoreSettings}配置中进行恢复
*
* @param executionGraphToRestore {@link ExecutionGraph} which is supposed to be restored
* @param savepointRestoreSettings {@link SavepointRestoreSettings} containing information about the savepoint to restore from
* @throws Exception if the {@link ExecutionGraph} could not be restored
*/
private void tryRestoreExecutionGraphFromSavepoint(ExecutionGraph executionGraphToRestore, SavepointRestoreSettings savepointRestoreSettings) throws Exception {
//从当前task判断是否有对应的savepoint
if (savepointRestoreSettings.restoreSavepoint()) {
final CheckpointCoordinator checkpointCoordinator = executionGraphToRestore.getCheckpointCoordinator();
if (checkpointCoordinator != null) {
checkpointCoordinator.restoreSavepoint(
savepointRestoreSettings.getRestorePath(),
savepointRestoreSettings.allowNonRestoredState(),
executionGraphToRestore.getAllVertices(),
userCodeLoader);
}
}
}
public boolean restoreSavepoint(
String savepointPointer,
boolean allowNonRestored,
Map<JobVertexID, ExecutionJobVertex> tasks,
ClassLoader userClassLoader) throws Exception {
Preconditions.checkNotNull(savepointPointer, "The savepoint path cannot be null.");
LOG.info("Starting job {} from savepoint {} ({})",
job, savepointPointer, (allowNonRestored ? "allowing non restored state" : ""));
final CompletedCheckpointStorageLocation checkpointLocation = checkpointStorage.resolveCheckpoint(savepointPointer);
// 从state后端加载已完成的保存点
CompletedCheckpoint savepoint = Checkpoints.loadAndValidateCheckpoint(
job, tasks, checkpointLocation, userClassLoader, allowNonRestored);
completedCheckpointStore.addCheckpoint(savepoint);
// 重置保存点计数器
long nextCheckpointId = savepoint.getCheckpointID() + 1;
checkpointIdCounter.setCount(nextCheckpointId);
LOG.info("Reset the checkpoint ID of job {} to {}.", job, nextCheckpointId);
//不断从最近的检查点进行恢复
return restoreLatestCheckpointedState(new HashSet<>(tasks.values()), true, allowNonRestored);
}
明确完checkpoint的产生流程之后,现在我们回到主题。在一个输入流的情况下也有checkpoint,如果是的话,是怎么生成checkpint快照的
先说结论,其实一个输入流的说法不大准确。应该说的是Flink做checkpoint的时候JobManager调度每个Source任务(比如下图中的 Source (Custom Source(1/1))并行度为1 即分区数为1) 。等每个task的快照屏障barrier到达后,才进行checkpoint
如果还记得上一篇任务调度流程的第三步:反射初始化算子,根据堆栈所示在发消息之前有一个init方法,在这里会触发该task的checkpoint
在上述代码段中,触发Checkpoint的正是函数SourceStreamTask.super.triggerCheckpointAsync
private boolean triggerCheckpoint(
CheckpointMetaData checkpointMetaData,
CheckpointOptions checkpointOptions,
boolean advanceToEndOfEventTime) throws Exception {
try {
// No alignment if we inject a checkpoint
CheckpointMetrics checkpointMetrics = new CheckpointMetrics()
.setBytesBufferedInAlignment(0L)
.setAlignmentDurationNanos(0L);
........................................
boolean success = performCheckpoint(checkpointMetaData, checkpointOptions, checkpointMetrics, advanceToEndOfEventTime);
if (!success) {
declineCheckpoint(checkpointMetaData.getCheckpointId());
}
return success;
} catch (Exception e) {
// propagate exceptions only if the task is still in "running" state
.......................................
}
}
}
单task触发checkpoint
一个算子 subtask的数目,被称为parallelism(并行度),即Custom Source(1/1)这样的Task线程
顺着上述的代码段,继续展开performCheckpoint方法,定位到checkpoint的关键执行函数SubtaskCheckpointCoordinator##checkpointState
1.准备检查点,允许算子进行一些预生成barrier工作。
2.向下游发送检查点barrier
3.准备在生成快照的缓冲区溢出以用于输入和输出
4.回放state快照。基于异步操作以免影响正在进行的checkpoint
SubtaskCheckpointCoordinator##checkpointState
@Override
public void checkpointState(
CheckpointMetaData metadata,
CheckpointOptions options,
CheckpointMetrics metrics,
OperatorChain<?, ?> operatorChain,
Supplier<Boolean> isCanceled) throws Exception {
checkNotNull(options);
checkNotNull(metrics);
// All of the following steps happen as an atomic step from the perspective of barriers and
// records/watermarks/timers/callbacks.
// We generally try to emit the checkpoint barrier as soon as possible to not affect downstream
// checkpoint alignments
//从barrier的角度来看,以下所有步骤都是原子步骤
//记录/水印/计时器/回调。
//我们通常尝试尽快发出检查点障碍,以不影响下游
//检查点对齐
// Step (1): Prepare the checkpoint, allow operators to do some pre-barrier work.
// The pre-barrier work should be nothing or minimal in the common case.
//步骤(1):准备检查点,允许算子进行一些预生成barrier工作。
//在通常情况下,预屏障工作应该为零或最少。
operatorChain.prepareSnapshotPreBarrier(metadata.getCheckpointId());
// Step (2): Send the checkpoint barrier downstream
//步骤(2):向下游发送检查点barrier
operatorChain.broadcastEvent(
new CheckpointBarrier(metadata.getCheckpointId(), metadata.getTimestamp(), options),
unalignedCheckpointEnabled);
// Step (3): Prepare to spill the in-flight buffers for input and output
//步骤(3):准备在生成快照的缓冲区溢出以用于输入和输出
if (unalignedCheckpointEnabled) {
prepareInflightDataSnapshot(metadata.getCheckpointId());
}
// Step (4): Take the state snapshot. This should be largely asynchronous, to not impact progress of the
// streaming topology
//Step (4): 回放state快照。这在很大程度上应该是异步的,以免影响
流拓扑
Map<OperatorID, OperatorSnapshotFutures> snapshotFutures = new HashMap<>(operatorChain.getNumberOfOperators());
try {
takeSnapshotSync(snapshotFutures, metadata, metrics, options, operatorChain, isCanceled);
finishAndReportAsync(snapshotFutures, metadata, metrics);
} catch (Exception ex) {
//清除快照
cleanup(snapshotFutures, metadata, metrics, options, ex);
}
}