线上问题
由于线上Spark集群均为过保机器,经常会出现某台机器出故障的情况,导致任务经常会进行失败重试,下面从源码角度解析失败重试的流程和重试逻辑。
流程
参考 juejin.cn/post/719771… 文章中的map任务传递部分,ShuffleMapTask执行完后会调用StatusUpdate,从TaskSchedulerImpl的statusUpdate方法开始分析。
1. 失败结果返回给TaskSetManager处理
TaskSchedulerImpl.statusUpdate()
if (Set(TaskState.FAILED, TaskState.KILLED, TaskState.LOST).contains(state)) {
taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData)
}
TaskResultGetter.enqueueFailedTask()
scheduler.handleFailedTask(taskSetManager, tid, taskState, reason)
TaskSchedulerImpl.handleFailedTask()
taskSetManager.handleFailedTask(tid, taskState, reason)
2. TaskSetManagerImpl处理失败Task
- 对于
FetchFailed: isZombie = true,进入Zombie状态
case fetchFailed: FetchFailed =>
logWarning(failureReason)
if (!successful(index)) {
successful(index) = true
tasksSuccessful += 1
}
isZombie = true
if (fetchFailed.bmAddress != null) {
blacklistTracker.foreach(_.updateBlacklistForFetchFailure(
fetchFailed.bmAddress.host, fetchFailed.bmAddress.executorId))
}
None
- 向
DAGScheduler发送任务状态
sched.dagScheduler.taskEnded(tasks(index), reason, null, accumUpdates, metricPeaks, info)
- 根据task失败次数判断stage是否失败,最大失败次数可在提交任务shell脚本里配置。这里如果开启了backlist功能,失败节点会加入backlist
if (!isZombie && reason.countTowardsTaskFailures) {
assert (null != failureReason)
taskSetBlacklistHelperOpt.foreach(_.updateBlacklistForFailedTask(
info.host, info.executorId, index, failureReason))
numFailures(index) += 1
if (numFailures(index) >= maxTaskFailures) {
logError("Task %d in stage %s failed %d times; aborting job".format(
index, taskSet.id, maxTaskFailures))
abort("Task %d in stage %s failed %d times, most recent failure: %s\nDriver stacktrace:"
.format(index, taskSet.id, maxTaskFailures, failureReason), failureException)
return
}
}
3. DAGSchedulerImpl处理失败任务
- 失败次数大于
spark.stage.maxConsecutiveAttempts,stage失败不再重试 - 将failedStage和对应的mapStage加入failedStages
- 对于INDETERMINATE stage,计算需要rollback的stage也就是所有后续的stage,进行重算
ResubmitFailedStages
case FetchFailed(bmAddress, shuffleId, _, mapIndex, _, failureMessage) =>
val failedStage = stageIdToStage(task.stageId)
val mapStage = shuffleIdToMapStage(shuffleId)
if (failedStage.latestInfo.attemptNumber != task.stageAttemptId) {
logInfo(s"Ignoring fetch failure from $task as it's from $failedStage attempt" +
s" ${task.stageAttemptId} and there is a more recent attempt for that stage " +
s"(attempt ${failedStage.latestInfo.attemptNumber}) running")
} else {
failedStage.failedAttemptIds.add(task.stageAttemptId)
val shouldAbortStage =
failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts ||
disallowStageRetryForTest
// It is likely that we receive multiple FetchFailed for a single stage (because we have
// multiple tasks running concurrently on different executors). In that case, it is
// possible the fetch failure has already been handled by the scheduler.
if (runningStages.contains(failedStage)) {
logInfo(s"Marking $failedStage (${failedStage.name}) as failed " +
s"due to a fetch failure from $mapStage (${mapStage.name})")
markStageAsFinished(failedStage, errorMessage = Some(failureMessage),
willRetry = !shouldAbortStage)
} else {
logDebug(s"Received fetch failure from $task, but it's from $failedStage which is no " +
"longer running")
}
if (mapStage.rdd.isBarrier()) {
// Mark all the map as broken in the map stage, to ensure retry all the tasks on
// resubmitted stage attempt.
mapOutputTracker.unregisterAllMapOutput(shuffleId)
} else if (mapIndex != -1) {
// Mark the map whose fetch failed as broken in the map stage
mapOutputTracker.unregisterMapOutput(shuffleId, mapIndex, bmAddress)
}
if (failedStage.rdd.isBarrier()) {
failedStage match {
case failedMapStage: ShuffleMapStage =>
// Mark all the map as broken in the map stage, to ensure retry all the tasks on
// resubmitted stage attempt.
mapOutputTracker.unregisterAllMapOutput(failedMapStage.shuffleDep.shuffleId)
case failedResultStage: ResultStage =>
// Abort the failed result stage since we may have committed output for some
// partitions.
val reason = "Could not recover from a failed barrier ResultStage. Most recent " +
s"failure reason: $failureMessage"
abortStage(failedResultStage, reason, None)
}
}
if (shouldAbortStage) {
val abortMessage = if (disallowStageRetryForTest) {
"Fetch failure will not retry stage due to testing config"
} else {
s"""$failedStage (${failedStage.name})
|has failed the maximum allowable number of
|times: $maxConsecutiveStageAttempts.
|Most recent failure reason: $failureMessage""".stripMargin.replaceAll("\n", " ")
}
abortStage(failedStage, abortMessage, None)
} else { // update failedStages and make sure a ResubmitFailedStages event is enqueued
// TODO: Cancel running tasks in the failed stage -- cf. SPARK-17064
val noResubmitEnqueued = !failedStages.contains(failedStage)
failedStages += failedStage
failedStages += mapStage
if (noResubmitEnqueued) {
if (mapStage.isIndeterminate) {
val stagesToRollback = HashSet[Stage](mapStage)
def collectStagesToRollback(stageChain: List[Stage]): Unit = {
if (stagesToRollback.contains(stageChain.head)) {
stageChain.drop(1).foreach(s => stagesToRollback += s)
} else {
stageChain.head.parents.foreach { s =>
collectStagesToRollback(s :: stageChain)
}
}
}
def generateErrorMessage(stage: Stage): String = {
"A shuffle map stage with indeterminate output was failed and retried. " +
s"However, Spark cannot rollback the $stage to re-process the input data, " +
"and has to fail this job. Please eliminate the indeterminacy by " +
"checkpointing the RDD before repartition and try again."
}
activeJobs.foreach(job => collectStagesToRollback(job.finalStage :: Nil))
// The stages will be rolled back after checking
val rollingBackStages = HashSet[Stage](mapStage)
stagesToRollback.foreach {
case mapStage: ShuffleMapStage =>
val numMissingPartitions = mapStage.findMissingPartitions().length
if (numMissingPartitions < mapStage.numTasks) {
if (sc.getConf.get(config.SHUFFLE_USE_OLD_FETCH_PROTOCOL)) {
val reason = "A shuffle map stage with indeterminate output was failed " +
"and retried. However, Spark can only do this while using the new " +
"shuffle block fetching protocol. Please check the config " +
"'spark.shuffle.useOldFetchProtocol', see more detail in " +
"SPARK-27665 and SPARK-25341."
abortStage(mapStage, reason, None)
} else {
rollingBackStages += mapStage
}
}
case resultStage: ResultStage if resultStage.activeJob.isDefined =>
val numMissingPartitions = resultStage.findMissingPartitions().length
if (numMissingPartitions < resultStage.numTasks) {
// TODO: support to rollback result tasks.
abortStage(resultStage, generateErrorMessage(resultStage), None)
}
case _ =>
}
logInfo(s"The shuffle map stage $mapStage with indeterminate output was failed, " +
s"we will roll back and rerun below stages which include itself and all its " +
s"indeterminate child stages: $rollingBackStages")
}
logInfo(
s"Resubmitting $mapStage (${mapStage.name}) and " +
s"$failedStage (${failedStage.name}) due to fetch failure"
)
messageScheduler.schedule(
new Runnable {
override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages)
},
DAGScheduler.RESUBMIT_TIMEOUT,
TimeUnit.MILLISECONDS
)
}
}
// TODO: mark the executor as failed only if there were lots of fetch failures on it
if (bmAddress != null) {
val hostToUnregisterOutputs = if (env.blockManager.externalShuffleServiceEnabled &&
unRegisterOutputOnHostOnFetchFailure) {
// We had a fetch failure with the external shuffle service, so we
// assume all shuffle data on the node is bad.
Some(bmAddress.host)
} else {
// Unregister shuffle data just for one executor (we don't have any
// reason to believe shuffle data has been lost for the entire host).
None
}
removeExecutorAndUnregisterOutputs(
execId = bmAddress.executorId,
fileLost = true,
hostToUnregisterOutputs = hostToUnregisterOutputs,
maybeEpoch = Some(task.epoch))
}
}
- 重新提交所有的失败stage
private[scheduler] def resubmitFailedStages(): Unit = {
if (failedStages.nonEmpty) {
// Failed stages may be removed by job cancellation, so failed might be empty even if
// the ResubmitFailedStages event has been scheduled.
logInfo("Resubmitting failed stages")
clearCacheLocs()
val failedStagesCopy = failedStages.toArray
failedStages.clear()
for (stage <- failedStagesCopy.sortBy(_.firstJobId)) {
submitStage(stage)
}
}
}
- 重算时候对于已经成功计算的分区不再计算,所以shuffleread数据量可能会减少 DAGScheduler.submitMissingTasks()
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
重试次数
- 对于
FetchFailed,会触发stage重试,达到stage重试次数后任务失败,其他情况是不触发stage重试的。 - 其他taskfail,task级别重试,达到task重试次数,任务失败。
- 任务失败后,还会有application级别重试。
线上报错截图:
本地性对于重试的影响
本地性可能会导致每次重试都在同一节点上,导致重试几次都失败了,可以看看这篇文章有详细分析
数据本地性影响如何规避
使用backlist机制。 对于一个 application ,提供了三种级别的黑名单可以用于 executor/node: task blacklist -> stage blacklist -> application blacklist 对于如上情况,可使用:
| 参数 | 含义 |
|---|---|
| spark.blacklist.task.maxTaskAttemptsPerNode | 对于同一个 task 在某个节点上的失败重试阈值。达到阈值后,在执行这个 task 时,该节点将被加入黑名单 |
关于INDETERMINATE stage
由于计算结果是不确定的,INDETERMINATE stage需要全部重算,参考:
INDETERMINATE的算子比如rePartition(),会生成一个随机的id,每次计算结果不同,参考:
与此同时,github.com/apache/spar… 将mapid改为递增的id,避免了重试生成的index文件和data文件名相同,间接解决了toutiao.io/posts/y4d6e… 中提出的id相同的问题,此版本在2.4生效。