线上问题

由于线上Spark集群均为过保机器，经常会出现某台机器出故障的情况，导致任务经常会进行失败重试，下面从源码角度解析失败重试的流程和重试逻辑。

流程

参考 juejin.cn/post/719771… 文章中的map任务传递部分，ShuffleMapTask执行完后会调用StatusUpdate,从TaskSchedulerImpl的statusUpdate方法开始分析。

1. 失败结果返回给TaskSetManager处理

TaskSchedulerImpl.statusUpdate()

if (Set(TaskState.FAILED, TaskState.KILLED, TaskState.LOST).contains(state)) {
  taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData)
}

TaskResultGetter.enqueueFailedTask()

scheduler.handleFailedTask(taskSetManager, tid, taskState, reason)

TaskSchedulerImpl.handleFailedTask()

taskSetManager.handleFailedTask(tid, taskState, reason)

2. TaskSetManagerImpl处理失败Task

对于FetchFailed : isZombie = true,进入Zombie状态

case fetchFailed: FetchFailed =>
  logWarning(failureReason)
  if (!successful(index)) {
    successful(index) = true
    tasksSuccessful += 1
  }
  isZombie = true

  if (fetchFailed.bmAddress != null) {
    blacklistTracker.foreach(_.updateBlacklistForFetchFailure(
      fetchFailed.bmAddress.host, fetchFailed.bmAddress.executorId))
  }

  None

向DAGScheduler发送任务状态

sched.dagScheduler.taskEnded(tasks(index), reason, null, accumUpdates, metricPeaks, info)

根据task失败次数判断stage是否失败，最大失败次数可在提交任务shell脚本里配置。这里如果开启了backlist功能，失败节点会加入backlist

if (!isZombie && reason.countTowardsTaskFailures) {
  assert (null != failureReason)
  taskSetBlacklistHelperOpt.foreach(_.updateBlacklistForFailedTask(
    info.host, info.executorId, index, failureReason))
  numFailures(index) += 1
  if (numFailures(index) >= maxTaskFailures) {
    logError("Task %d in stage %s failed %d times; aborting job".format(
      index, taskSet.id, maxTaskFailures))
    abort("Task %d in stage %s failed %d times, most recent failure: %s\nDriver stacktrace:"
      .format(index, taskSet.id, maxTaskFailures, failureReason), failureException)
    return
  }
}

3. DAGSchedulerImpl处理失败任务

失败次数大于spark.stage.maxConsecutiveAttempts，stage失败不再重试
将failedStage和对应的mapStage加入failedStages
对于INDETERMINATE stage,计算需要rollback的stage也就是所有后续的stage，进行重算
ResubmitFailedStages

case FetchFailed(bmAddress, shuffleId, _, mapIndex, _, failureMessage) =>
  val failedStage = stageIdToStage(task.stageId)
  val mapStage = shuffleIdToMapStage(shuffleId)

  if (failedStage.latestInfo.attemptNumber != task.stageAttemptId) {
    logInfo(s"Ignoring fetch failure from $task as it's from $failedStage attempt" +
      s" ${task.stageAttemptId} and there is a more recent attempt for that stage " +
      s"(attempt ${failedStage.latestInfo.attemptNumber}) running")
  } else {
    failedStage.failedAttemptIds.add(task.stageAttemptId)
    val shouldAbortStage =
      failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts ||
      disallowStageRetryForTest

    // It is likely that we receive multiple FetchFailed for a single stage (because we have
    // multiple tasks running concurrently on different executors). In that case, it is
    // possible the fetch failure has already been handled by the scheduler.
    if (runningStages.contains(failedStage)) {
      logInfo(s"Marking $failedStage (${failedStage.name}) as failed " +
        s"due to a fetch failure from $mapStage (${mapStage.name})")
      markStageAsFinished(failedStage, errorMessage = Some(failureMessage),
        willRetry = !shouldAbortStage)
    } else {
      logDebug(s"Received fetch failure from $task, but it's from $failedStage which is no " +
        "longer running")
    }

    if (mapStage.rdd.isBarrier()) {
      // Mark all the map as broken in the map stage, to ensure retry all the tasks on
      // resubmitted stage attempt.
      mapOutputTracker.unregisterAllMapOutput(shuffleId)
    } else if (mapIndex != -1) {
      // Mark the map whose fetch failed as broken in the map stage
      mapOutputTracker.unregisterMapOutput(shuffleId, mapIndex, bmAddress)
    }

    if (failedStage.rdd.isBarrier()) {
      failedStage match {
        case failedMapStage: ShuffleMapStage =>
          // Mark all the map as broken in the map stage, to ensure retry all the tasks on
          // resubmitted stage attempt.
          mapOutputTracker.unregisterAllMapOutput(failedMapStage.shuffleDep.shuffleId)

        case failedResultStage: ResultStage =>
          // Abort the failed result stage since we may have committed output for some
          // partitions.
          val reason = "Could not recover from a failed barrier ResultStage. Most recent " +
            s"failure reason: $failureMessage"
          abortStage(failedResultStage, reason, None)
      }
    }

    if (shouldAbortStage) {
      val abortMessage = if (disallowStageRetryForTest) {
        "Fetch failure will not retry stage due to testing config"
      } else {
        s"""$failedStage (${failedStage.name})
           |has failed the maximum allowable number of
           |times: $maxConsecutiveStageAttempts.
           |Most recent failure reason: $failureMessage""".stripMargin.replaceAll("\n", " ")
      }
      abortStage(failedStage, abortMessage, None)
    } else { // update failedStages and make sure a ResubmitFailedStages event is enqueued
      // TODO: Cancel running tasks in the failed stage -- cf. SPARK-17064
      val noResubmitEnqueued = !failedStages.contains(failedStage)
      failedStages += failedStage
      failedStages += mapStage
      if (noResubmitEnqueued) {
        if (mapStage.isIndeterminate) {
          val stagesToRollback = HashSet[Stage](mapStage)

          def collectStagesToRollback(stageChain: List[Stage]): Unit = {
            if (stagesToRollback.contains(stageChain.head)) {
              stageChain.drop(1).foreach(s => stagesToRollback += s)
            } else {
              stageChain.head.parents.foreach { s =>
                collectStagesToRollback(s :: stageChain)
              }
            }
          }

          def generateErrorMessage(stage: Stage): String = {
            "A shuffle map stage with indeterminate output was failed and retried. " +
              s"However, Spark cannot rollback the $stage to re-process the input data, " +
              "and has to fail this job. Please eliminate the indeterminacy by " +
              "checkpointing the RDD before repartition and try again."
          }

          activeJobs.foreach(job => collectStagesToRollback(job.finalStage :: Nil))

          // The stages will be rolled back after checking
          val rollingBackStages = HashSet[Stage](mapStage)
          stagesToRollback.foreach {
            case mapStage: ShuffleMapStage =>
              val numMissingPartitions = mapStage.findMissingPartitions().length
              if (numMissingPartitions < mapStage.numTasks) {
                if (sc.getConf.get(config.SHUFFLE_USE_OLD_FETCH_PROTOCOL)) {
                  val reason = "A shuffle map stage with indeterminate output was failed " +
                    "and retried. However, Spark can only do this while using the new " +
                    "shuffle block fetching protocol. Please check the config " +
                    "'spark.shuffle.useOldFetchProtocol', see more detail in " +
                    "SPARK-27665 and SPARK-25341."
                  abortStage(mapStage, reason, None)
                } else {
                  rollingBackStages += mapStage
                }
              }

            case resultStage: ResultStage if resultStage.activeJob.isDefined =>
              val numMissingPartitions = resultStage.findMissingPartitions().length
              if (numMissingPartitions < resultStage.numTasks) {
                // TODO: support to rollback result tasks.
                abortStage(resultStage, generateErrorMessage(resultStage), None)
              }

            case _ =>
          }
          logInfo(s"The shuffle map stage $mapStage with indeterminate output was failed, " +
            s"we will roll back and rerun below stages which include itself and all its " +
            s"indeterminate child stages: $rollingBackStages")
        }
        logInfo(
          s"Resubmitting $mapStage (${mapStage.name}) and " +
            s"$failedStage (${failedStage.name}) due to fetch failure"
        )
        messageScheduler.schedule(
          new Runnable {
            override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages)
          },
          DAGScheduler.RESUBMIT_TIMEOUT,
          TimeUnit.MILLISECONDS
        )
      }
    }

    // TODO: mark the executor as failed only if there were lots of fetch failures on it
    if (bmAddress != null) {
      val hostToUnregisterOutputs = if (env.blockManager.externalShuffleServiceEnabled &&
        unRegisterOutputOnHostOnFetchFailure) {
        // We had a fetch failure with the external shuffle service, so we
        // assume all shuffle data on the node is bad.
        Some(bmAddress.host)
      } else {
        // Unregister shuffle data just for one executor (we don't have any
        // reason to believe shuffle data has been lost for the entire host).
        None
      }
      removeExecutorAndUnregisterOutputs(
        execId = bmAddress.executorId,
        fileLost = true,
        hostToUnregisterOutputs = hostToUnregisterOutputs,
        maybeEpoch = Some(task.epoch))
    }
  }

重新提交所有的失败stage

private[scheduler] def resubmitFailedStages(): Unit = {
  if (failedStages.nonEmpty) {
    // Failed stages may be removed by job cancellation, so failed might be empty even if
    // the ResubmitFailedStages event has been scheduled.
    logInfo("Resubmitting failed stages")
    clearCacheLocs()
    val failedStagesCopy = failedStages.toArray
    failedStages.clear()
    for (stage <- failedStagesCopy.sortBy(_.firstJobId)) {
      submitStage(stage)
    }
  }
}

重算时候对于已经成功计算的分区不再计算，所以shuffleread数据量可能会减少 DAGScheduler.submitMissingTasks()

val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

重试次数

对于FetchFailed，会触发stage重试，达到stage重试次数后任务失败，其他情况是不触发stage重试的。
其他taskfail，task级别重试，达到task重试次数，任务失败。
任务失败后，还会有application级别重试。

线上报错截图：

本地性对于重试的影响

本地性可能会导致每次重试都在同一节点上，导致重试几次都失败了，可以看看这篇文章有详细分析

toutiao.io/posts/y4d6e…

数据本地性影响如何规避

使用backlist机制。对于一个 application ，提供了三种级别的黑名单可以用于 executor/node： task blacklist -> stage blacklist -> application blacklist 对于如上情况，可使用：

参数	含义
spark.blacklist.task.maxTaskAttemptsPerNode	对于同一个 task 在某个节点上的失败重试阈值。达到阈值后，在执行这个 task 时，该节点将被加入黑名单

关于INDETERMINATE stage

由于计算结果是不确定的，INDETERMINATE stage需要全部重算，参考：

github.com/apache/spar…

INDETERMINATE的算子比如rePartition(),会生成一个随机的id，每次计算结果不同，参考：

github.com/apache/spar…

与此同时，github.com/apache/spar… 将mapid改为递增的id，避免了重试生成的index文件和data文件名相同，间接解决了toutiao.io/posts/y4d6e… 中提出的id相同的问题，此版本在2.4生效。

任务失败和重试逻辑

线上问题

流程

重试次数

本地性对于重试的影响

数据本地性影响如何规避

关于INDETERMINATE stage