流程简单介绍

RDD转换以及DAG构建
DAGScheduler:
- 划分stage
- 创建并提交task
TaskScheduler: 调度task，按照调度池优先级，本地性等进行任务调度
执行任务

DAGScheduler

（这里仅显示主要方法）
handleJobSubmitted

创建finalStage并划分stage

finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)

创建job

val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)

递归提交所有未被计算的stage

submitStage(finalStage)

submitStage

找到所有不可用（也就是未提交）的父stage，找父stage的方案是DFS找到RDD的宽依赖

val missing = getMissingParentStages(stage).sortBy(_.id)

判断stage可用，已经输出计算结果的分区任务数量和分区数一样
2. 假如该stage没有未提交的父Stage，直接提交该Stage

submitMissingTasks(stage, jobId.get)

假如有未提交的父Stage，继续递归调用遍历

submitStage(parent)

submitMissingTasks

获取还未计算的分区

val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

计算任务本地性
getPreferredLocsInternal 方法会分为三种情况
- 如果RDD被缓存，通过缓存的位置信息获取每个分区的位置信息
- 如果RDD有偏向位置，通过preferredLocations获取每个分区的位置信息
- 遍历RDD的所有是NarrowDependency的父RDD，返回父RDD的偏向位置

val taskIdToLocations: Map[Int, Seq[TaskLocation]] =  
  stage match {
    case s: ShuffleMapStage =>
      partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
    case s: ResultStage =>
      partitionsToCompute.map { id =>
        val p = s.partitions(id)
        (id, getPreferredLocs(stage.rdd, p))
      }.toMap
  }

构建任务的广播变量
广播序列化的RDD，每个任务反序列化拿到一个RDD的副本，保证任务之间的隔离，在非线程安全的情况下非常必要。

taskBinaryBytes = stage match {
  case stage: ShuffleMapStage =>
    JavaUtils.bufferToArray(
      closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
  case stage: ResultStage =>
    JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
taskBinary = sc.broadcast(taskBinaryBytes)

构建task

val tasks: Seq[Task[_]] = {
  val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
  stage match {
    case stage: ShuffleMapStage =>
      stage.pendingPartitions.clear()
      partitionsToCompute.map { id =>
        val locs = taskIdToLocations(id)
        val part = partitions(id)
        stage.pendingPartitions += id
        new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
          taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
          Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
      }

    case stage: ResultStage =>
      partitionsToCompute.map { id =>
        val p: Int = stage.partitions(id)
        val part = partitions(p)
        val locs = taskIdToLocations(id)
        new ResultTask(stage.id, stage.latestInfo.attemptNumber,
          taskBinary, part, locs, id, properties, serializedTaskMetrics,
          Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
          stage.rdd.isBarrier())
      }
  }
}

构建TaskSet并提交

taskScheduler.submitTasks(new TaskSet(
  tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))

TaskSchedulerImpl

初始化
调度池初始化

  def initialize(backend: SchedulerBackend): Unit = {
    this.backend = backend
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
        case _ =>
          throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
          s"$schedulingMode")
      }
    }
    schedulableBuilder.buildPools()
  }

启动CoarseGrainedSchedulerBackend，假如配置了推测执行的话会启动一个周期定时器，周期性检测检测需要推测执行的task()。

 override def start(): Unit = {
  backend.start()
  if (!isLocal && conf.get(SPECULATION_ENABLED)) {
    logInfo("Starting speculative execution thread")
    speculationScheduler.scheduleWithFixedDelay(
      () => Utils.tryOrStopSparkContext(sc) { checkSpeculatableTasks() },
      SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
  }
}

主要成员

TaskSetManager：对Task进行调度，包括任务推断、Task本地性，并对Task进行资源分配。

决定在某个executor上是否启动及启动哪个task
为了达到Locality aware，将Task的调度做相应的延迟
当一个Task失败的时候，在约定的失败次数之内时，将Task重新提交
处理拖后腿的task

schedulableBuilder: 对taskset进行调度，spark的调度模式分为两种：FIFO(先进先出)和FAIR(公平调度)。默认是FIFO，即谁先提交谁先执行，而FAIR支持在调度池中再进行分组，可以有不同的权重，根据权重、资源等来决定谁先执行。spark的调度模式可以通过spark.scheduler.mode进行设置。

cloud.tencent.com/developer/a…

CoarseGrainedSchedulerBackend: 获取集群内可用资源的情况，并将分布式任务分发给Executor。

submitTasks
走到这一步可以看到打印日志

logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")

将TaskSetScheduluer,taskSet,最大失败次数构建TaskSetManager

val manager = createTaskSetManager(taskSet, maxTaskFailures)

将TaskSetManager加入调度器

schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)

资源分配，调度任务

backend.reviveOffers()

CoarseGrainedSchedulerBackend##makeOffers()
它将集群的资源以workOffer的方式发给上层的TaskSchedulerImpl。TaskSchedulerImpl调用scheduler.resourceOffers获得要被执行的Seq[TaskDescription]，然后将得到的Seq[TaskDescription]交给CoarseGrainedSchedulerBackend分发到各个executor上执行。

获取executor上的可用资源，创建WorkOffer

       val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
       val workOffers = activeExecutors.map {
         case (id, executorData) =>
           //代表一个executor上的可用资源（这里仅可用cores）
           new WorkerOffer(id, executorData.executorHost, executorData.freeCores,
             Some(executorData.executorAddress.hostPort),
             executorData.resourcesInfo.map { case (rName, rInfo) =>
               (rName, rInfo.availableAddrs.toBuffer)
             })
       }.toIndexedSeq

分配资源,即找出在哪些exec上启动哪些task

scheduler.resourceOffers(workOffers)

运行任务

   if (taskDescs.nonEmpty) {
        launchTasks(taskDescs)
      }

resourceOffers

标记executor与host关系
随机打乱资源的顺序,目的是为了分配tasks能负载均衡，分配tasks时，从打乱的workers的序列的0下标开始判断是否能在worker上启动task

val shuffledOffers = shuffleOffers(filteredOffers)

根据每个workoffer的核数创建同等数量的任务描述List[workerId, ArrayBuffer[TaskDescription]]

val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores / CPUS_PER_TASK))

获取所有空闲的cpu

val availableCpus = shuffledOffers.map(o => o.cores).toArray

获取排序好的TaskSet，假如有新加入的Executor，为其重新计算本地性

val sortedTaskSets = rootPool.getSortedTaskSetQueue
for (taskSet <- sortedTaskSets) {
  if (newExecAvail) {
    taskSet.executorAdded()
  }
}

取出一个TaskSet，遍历所有的调度位置,对于locality从高到底，遍历所有worker，判断哪些tasks可以在哪些worker上启动。

    for (taskSet <- sortedTaskSets) {
        for (currentMaxLocality <- taskSet.myLocalityLevels) {
          var launchedTaskAtCurrentMaxLocality = false
          do {
            launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(taskSet,
              currentMaxLocality, shuffledOffers, availableCpus,
              availableResources, tasks, addressesWithDescs)
            launchedAnyTask |= launchedTaskAtCurrentMaxLocality
          } while (launchedTaskAtCurrentMaxLocality)
        }
}

resourceOfferSingleTaskSet

遍历每个worker的可用cores，如果可用cores大于task需要的cores数（即CPUS_PER_TASK），进入2
调用taskSet.resourceOffer(execId, host, maxLocality)获取可在指定executor上启动的task，若返回非空，把返回的task加到最终的tasks: Seq[ArrayBuffer[TaskDescription]]中，该结构保存要在哪些worker上启动哪些tasks
减少2中分配了task的worker的可用cores及更新其他信息

private def resourceOfferSingleTaskSet(
    taskSet: TaskSetManager,
    maxLocality: TaskLocality,
    shuffledOffers: Seq[WorkerOffer],
    availableCpus: Array[Int],
    tasks: Seq[ArrayBuffer[TaskDescription]]) : Boolean = {
  var launchedTask = false

  //< 获取每个worker上要执行的tasks序列
  for (i <- 0 until shuffledOffers.size) {
    val execId = shuffledOffers(i).executorId
    val host = shuffledOffers(i).host
    if (availableCpus(i) >= CPUS_PER_TASK) {
      try {
        for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {
          //< 将获得要在index为i的worker上执行的task，添加到tasks(i)中；这样就知道了要在哪个worker上执行哪些tasks了
          tasks(i) += task

          availableCpus(i) -= CPUS_PER_TASK
          assert(availableCpus(i) >= 0)
          launchedTask = true
        }
      } catch {
        case e: TaskNotSerializableException =>
          return launchedTask
      }
    }
  }
  return launchedTask
}

TaskSetManager##resourceOffer

获得能容忍的最差locality

if (maxLocality != TaskLocality.NO_PREF) {
  allowedLocality = getAllowedLocalityLevel(curTime)
  if (allowedLocality > maxLocality) {
    // We're not allowed to search for farther-away tasks
    allowedLocality = maxLocality
  }
}

根据locality，execId，host分配task，如果找到合适task，进行各种信息更新，并通知DAGScheduler,最后返回TaskDescription
打印日志：

logInfo(s"Starting $taskName (TID $taskId, $host, executor ${info.executorId}, " +
    s"partition ${task.partitionId}, $taskLocality, ${serializedTask.limit()} bytes)")

dequeueTask(execId, host, allowedLocality).map { case ((index, taskLocality, speculative)) =>
  // Found a task; do some bookkeeping and return a task description
  val task = tasks(index)
  val taskId = sched.newTaskId()
  // Do various bookkeeping
  copiesRunning(index) += 1
  val attemptNum = taskAttempts(index).size
  val info = new TaskInfo(taskId, index, attemptNum, curTime,
    execId, host, taskLocality, speculative)
  taskInfos(taskId) = info
  taskAttempts(index) = info :: taskAttempts(index)
  // Update our locality level for delay scheduling
  // NO_PREF will not affect the variables related to delay scheduling
  if (maxLocality != TaskLocality.NO_PREF) {
    currentLocalityIndex = getLocalityIndex(taskLocality)
    lastLaunchTime = curTime
  }
  // Serialize and return the task
  val serializedTask: ByteBuffer = try {
    ser.serialize(task)
  } catch {
    // If the task cannot be serialized, then there's no point to re-attempt the task,
    // as it will always fail. So just abort the whole task-set.
    case NonFatal(e) =>
      val msg = s"Failed to serialize task $taskId, not attempting to retry it."
      logError(msg, e)
      abort(s"$msg Exception during serialization: $e")
      throw new TaskNotSerializableException(e)
  }
  if (serializedTask.limit() > TaskSetManager.TASK_SIZE_TO_WARN_KIB * 1024 &&
    !emittedTaskSizeWarning) {
    emittedTaskSizeWarning = true
    logWarning(s"Stage ${task.stageId} contains a task of very large size " +
      s"(${serializedTask.limit() / 1024} KiB). The maximum recommended task size is " +
      s"${TaskSetManager.TASK_SIZE_TO_WARN_KIB} KiB.")
  }
  addRunningTask(taskId)

  // We used to log the time it takes to serialize the task, but task size is already
  // a good proxy to task serialization time.
  // val timeTaken = clock.getTime() - startTime
  val taskName = s"task ${info.id} in stage ${taskSet.id}"
  logInfo(s"Starting $taskName (TID $taskId, $host, executor ${info.executorId}, " +
    s"partition ${task.partitionId}, $taskLocality, ${serializedTask.limit()} bytes)")


  sched.dagScheduler.taskStarted(task, info)
  new TaskDescription(
    taskId,
    attemptNum,
    execId,
    taskName,
    index,
    task.partitionId,
    addedFiles,
    addedJars,
    task.localProperties,
    extraResources,
    serializedTask)
}

TaskSetManager##getAllowedLocalityLevel
tasksNeedToBeScheduledFrom:就是根据传递进来的task列表，判断是否还有task需要被调度（判断的规则就是该task的index还没有加入到copiesRunning，即没有被分配过），就返回true，没有就从列表中删除task，并且再从末尾重新取，循环这个过程。
moreTasksToRunIn:对于不同等级的 locality level 的 tasks 列表，将已经成功执行的或正在执行的该 locality level 的 task 从对应的列表中移除;判断对应的 locality level 的 task 是否还要等待执行的，若有则返回 true，否则返回 false。

  private def getAllowedLocalityLevel(curTime: Long): TaskLocality.TaskLocality = {
    // Remove the scheduled or finished tasks lazily
    def tasksNeedToBeScheduledFrom(pendingTaskIds: ArrayBuffer[Int]): Boolean = {
      var indexOffset = pendingTaskIds.size
      while (indexOffset > 0) {
        indexOffset -= 1
        val index = pendingTaskIds(indexOffset)
        //copiesRunning在任务被调度后就为1 或者任务失败了
        if (copiesRunning(index) == 0 && !successful(index)) {
          return true
        } else {
          pendingTaskIds.remove(indexOffset)
        }
      }
      false
    }
    // Walk through the list of tasks that can be scheduled at each location and returns true
    // if there are any tasks that still need to be scheduled. Lazily cleans up tasks that have
    // already been scheduled.
    //遍历task列表，如果有未被调度的task就返回true,移除已经被调度的任务
    def moreTasksToRunIn(pendingTasks: HashMap[String, ArrayBuffer[Int]]): Boolean = {
      val emptyKeys = new ArrayBuffer[String]
      val hasTasks = pendingTasks.exists {
        case (id: String, tasks: ArrayBuffer[Int]) =>
          if (tasksNeedToBeScheduledFrom(tasks)) {
            true
          } else {
            emptyKeys += id
            false
          }
      }
      // The key could be executorId, host or rackId
      emptyKeys.foreach(id => pendingTasks.remove(id))
      hasTasks
    }

    while (currentLocalityIndex < myLocalityLevels.length - 1) {
      val moreTasks = myLocalityLevels(currentLocalityIndex) match {
        case TaskLocality.PROCESS_LOCAL => moreTasksToRunIn(pendingTasks.forExecutor)
        case TaskLocality.NODE_LOCAL => moreTasksToRunIn(pendingTasks.forHost)
        case TaskLocality.NO_PREF => pendingTasks.noPrefs.nonEmpty
        case TaskLocality.RACK_LOCAL => moreTasksToRunIn(pendingTasks.forRack)
      }
      if (!moreTasks) {
        // This is a performance optimization: if there are no more tasks that can
        // be scheduled at a particular locality level, there is no point in waiting
        // for the locality wait timeout (SPARK-4939).
        lastLaunchTime = curTime
        logDebug(s"No tasks for locality level ${myLocalityLevels(currentLocalityIndex)}, " +
          s"so moving to locality level ${myLocalityLevels(currentLocalityIndex + 1)}")
        currentLocalityIndex += 1
      } else if (curTime - lastLaunchTime >= localityWaits(currentLocalityIndex)) {
        // Jump to the next locality level, and reset lastLaunchTime so that the next locality
        // wait timer doesn't immediately expire
        lastLaunchTime += localityWaits(currentLocalityIndex)
        logDebug(s"Moving to ${myLocalityLevels(currentLocalityIndex + 1)} after waiting for " +
          s"${localityWaits(currentLocalityIndex)}ms")
        currentLocalityIndex += 1
      } else {
        return myLocalityLevels(currentLocalityIndex)
      }
    }
    myLocalityLevels(currentLocalityIndex)
  }

整个循环体都在做这几个事情：

判断 myLocalityLevels(currentLocalityIndex) 这个级别的本地性对应的待执行 tasks 集合中是否还有待执行的 task。
若无；locality level 降低一级继续循环。
若有，且当前时间与上次提交时间间隔小于当前locality对应的延迟时间（通过spark.locality.wait.process或spark.locality.wait.node或spark.locality.wait.rack配置），则 currentLocalityIndex 不变，返回myLocalityLevels(currentLocalityIndex)。这里是延迟调度的关键，只要当前时间与上一次以某个 locality level 启动 task 的时间只差小于配置的值，不管上次是否成功启动了 task，这一次仍然以上次的 locality level 来启动 task。
若有，且当前时间与上次getAllowedLocalityLevel返回 myLocalityLevels(currentLocalityIndex) 时间间隔大于当前locality对应的延迟时间，则locality level 降低一级继续循环。

Spark延迟调度策略

    for (taskSet <- sortedTaskSets) {
        for (currentMaxLocality <- taskSet.myLocalityLevels) {
          var launchedTaskAtCurrentMaxLocality = false
          do {
            launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(taskSet,
              currentMaxLocality, shuffledOffers, availableCpus,
              availableResources, tasks, addressesWithDescs)
            launchedAnyTask |= launchedTaskAtCurrentMaxLocality
          } while (launchedTaskAtCurrentMaxLocality)
        }
}

假设有一个job，yarn模式，10个executor[exec1-exec10]，对应host1--host10,读取HDFS数据，无缓存，那么优先级基本上是NODE_LOCAL,RACK_LOCAL,ANY三种。比如在处理NODE_LOCAL级别的task时最后一个任务分配给了exec2，此时时间为time1，pendingTasksForHost上已经没有host1-host10的任务列表了或者列表都是空了，那么后面8个executor就都没有分配到task。这时launchedTaskAtCurrentMaxLocality为true,所以所以还会进行第二次的executor列表遍历，第二次遍历所有的列表依旧没有分配任务，那么就会进行RACK_LOCAL级别的任务调度。

if (!moreTasks) {
        // This is a performance optimization: if there are no more tasks that can
        // be scheduled at a particular locality level, there is no point in waiting
        // for the locality wait timeout (SPARK-4939).
        lastLaunchTime = curTime
        logDebug(s"No tasks for locality level ${myLocalityLevels(currentLocalityIndex)}, " +
          s"so moving to locality level ${myLocalityLevels(currentLocalityIndex + 1)}")
        currentLocalityIndex += 1
      } else if (curTime - lastLaunchTime >= localityWaits(currentLocalityIndex)) {
        // Jump to the next locality level, and reset lastLaunchTime so that the next locality
        // wait timer doesn't immediately expire
        lastLaunchTime += localityWaits(currentLocalityIndex)
        logDebug(s"Moving to ${myLocalityLevels(currentLocalityIndex + 1)} after waiting for " +
          s"${localityWaits(currentLocalityIndex)}ms")
        currentLocalityIndex += 1
      } else {
        return myLocalityLevels(currentLocalityIndex)
      }
}

因为HDFS数据会分布在很多个主机上，有可能某个task的数据在host11-host13上，所以此时还有可能存在host11-host13对应的tasks列表有未调度的task（不在copiesRunning中）,那么在进行遍历时，spark认为NODE_LOCAL还有需要分配的task，所以TaskSetManager中的allowedLocality仍旧是NODE_LOCAL,在TaskSetManager内部进行任务选取的时候仍然是按照NODE_LOCAL进行选取的（从pendingTasksForHost而不是pendingTasksForRack中选取），那么这个时候所有的executor就会都分配不到任务，但是如果在遍历到exec8时,当前时间currTime-time1>NODE_LOCAL waitTime,那么此时TaskSetManager自身的allowedLocality就会调到RACK_LOCAL，而exec8就会分到一个同机架的RACK_LOCAL task，之后的executor就都会按照RACK_LOCAL来获取task了。

CoarseGrainedSchedulerBackend

if (TaskState.isFinished(state)) {
  executorDataMap.get(executorId) match {
    case Some(executorInfo) =>
      executorInfo.freeCores += scheduler.CPUS_PER_TASK
      resources.foreach { case (k, v) =>
        executorInfo.resourcesInfo.get(k).foreach { r =>
          r.release(v.addresses)
        }
      }
      makeOffers(executorId)
    case None =>
      // Ignoring the update since we don't know about the executor.
      logWarning(s"Ignored task status update ($taskId state $state) " +
        s"from unknown executor with ID $executorId")
  }
}

private def makeOffers(executorId: String): Unit = {
  // Make sure no executor is killed while some task is launching on it
  val taskDescs = withLock {
    // Filter out executors under killing
    if (executorIsAlive(executorId)) {
      val executorData = executorDataMap(executorId)
      val workOffers = IndexedSeq(
        new WorkerOffer(executorId, executorData.executorHost, executorData.freeCores,
          Some(executorData.executorAddress.hostPort),
          executorData.resourcesInfo.map { case (rName, rInfo) =>
            (rName, rInfo.availableAddrs.toBuffer)
          }))
      scheduler.resourceOffers(workOffers)
    } else {
      Seq.empty
    }
  }
  if (taskDescs.nonEmpty) {
    launchTasks(taskDescs)
  }
}

如果确实是cores<tasknum，那么在该executor执行完一个task之后，会单独的对这个executor进行一次任务分配，将这个本地任务分配给该executor，如果这时已经没有了pendingTasksForHost中已经没有待调度的task了，那么TaskSetManager将会下调到RACK_LOCAL,其它executor上也就能分配到RACK_LOCAL级别的task了，所以这里这个等待策略就是指本地策略级别LocalityLevel所能等待的任务分配最大时间，超过这个时间之后，TaskSetManager将自动下调至下一个级别，进行下一个级别的task调度。

参考

cloud.tencent.com/developer/a…
zhuanlan.zhihu.com/p/541505732

任务提交

CoarseGrainedSchedulerBackend##launchTasks
进行资源分配，并向executor发送任务

for (task <- tasks.flatten) {
  val serializedTask = TaskDescription.encode(task)
  if (serializedTask.limit() >= maxRpcMessageSize) {
    Option(scheduler.taskIdToTaskSetManager.get(task.taskId)).foreach { taskSetMgr =>
      try {
        var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
          s"${RPC_MESSAGE_MAX_SIZE.key} (%d bytes). Consider increasing " +
          s"${RPC_MESSAGE_MAX_SIZE.key} or using broadcast variables for large values."
        msg = msg.format(task.taskId, task.index, serializedTask.limit(), maxRpcMessageSize)
        taskSetMgr.abort(msg)
      } catch {
        case e: Exception => logError("Exception in error callback", e)
      }
    }
  }
  else {
    val executorData = executorDataMap(task.executorId)
    // Do resources allocation here. The allocated resources will get released after the task
    // finishes.
    executorData.freeCores -= scheduler.CPUS_PER_TASK
    task.resources.foreach { case (rName, rInfo) =>
      assert(executorData.resourcesInfo.contains(rName))
      executorData.resourcesInfo(rName).acquire(rInfo.addresses)
    }

    logDebug(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
      s"${executorData.executorHost}.")

    executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
  }
}

CoarseGrainedExecutorBackend##receive
收到driver发送的LaunchTask消息

case LaunchTask(data) =>
  if (executor == null) {
    exitExecutor(1, "Received LaunchTask command but executor was null")
  } else {
    val taskDesc = TaskDescription.decode(data.value)
    logInfo("Got assigned task " + taskDesc.taskId)
    taskResources(taskDesc.taskId) = taskDesc.resources
    executor.launchTask(this, taskDesc)
  }

Executor##launchTask
调用TaskRunner的run方法

case LaunchTask(data) =>
  if (executor == null) {
    exitExecutor(1, "Received LaunchTask command but executor was null")
  } else {
    val taskDesc = TaskDescription.decode(data.value)
    logInfo("Got assigned task " + taskDesc.taskId)
    taskResources(taskDesc.taskId) = taskDesc.resources
    executor.launchTask(this, taskDesc)
  }

TaskRunner##run 打印日志

logInfo(s"Running $taskName (TID $taskId)")

状态更新为running

execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER)

任务还原
- 更新依赖文件或jar包
- 对序列化的serializedTask执行反序列化操作

updateDependencies(taskDescription.addedFiles, taskDescription.addedJars)
task = ser.deserialize[Task[Any]](
  taskDescription.serializedTask, Thread.currentThread.getContextClassLoader)

任务运行调用Task的run方法

task.run(
  taskAttemptId = taskId,
  attemptNumber = taskDescription.attemptNumber,
  metricsSystem = env.metricsSystem,
  resources = taskDescription.resources)

Task##run
最终调用Task的runTask方法，分为ResultTask和ShuffleMapTask,接下来就是shuffle

runTask(context)

Spark作业与任务调度

流程简单介绍

DAGScheduler

TaskSchedulerImpl

任务提交