微信公众号:王了个博
专注于大数据技术,人工智能和编程语言
个人既可码代码也可以码文字。欢迎转发与关注
RDD任务切分中间分为:Application、Job、Stage和Task
(1)Application:初始化一个SparkContext即生成一个Application;
(2)Job:一个Action算子就会生成一个Job;
(3)Stage:Stage等于宽依赖的个数加1;
(4)Task:一个Stage阶段中,最后一个RDD的分区个数就是Task的个数。
注意:Application->Job->Stage->Task每一层都是1对n的关系
主要步骤
代码样例:主程序
1// 代码样例 2def main(args: Array[String]): Unit = { 3 //1.创建SparkConf并设置App名称 4 val conf: SparkConf = new SparkConf().setAppName("SparkCoreTest").setMaster("local[*]") 5 //2.创建SparkContext,该对象是提交Spark App的入口 6 val sc: SparkContext = new SparkContext(conf) 7 val rdd:RDD[String] = sc.textFile("input/1.txt") 8 val mapRdd = rdd.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_) 9 mapRdd.saveAsTextFile("outpath")10 //3.关闭连接11 sc.stop()12 }执行流程图(Yarn-Cluster)
现在一步一步分析
1. 第一步
执行main方法
初始化sc
执行到Action算子
这个阶段会产生血缘依赖关系,具体的数据处理还没有开始
2. 第二步:DAGScheduler对上面的job切分stage,stage产生task
DAGScheduler:先划分阶段(stage)再划分任务(task)
这个时候会产生Job的stage个数 = 宽依赖的个数+1 = 2 (这个地方产生一个宽依赖),也就是产生shuffle这个地方
Job的Task个数= 一个stage阶段中,最后一个RDD的分区个数就是Task的个数(2+2 =4)
shuffle前的ShuffleStage产生两个,shuffle后reduceStage产生两个
3. 第三步:TaskSchedule通过TaskSet获取job的所有Task,然后序列化分给Exector
job的个数也就是 = Action算子的个数(这里只一个collect)= 1
源码分析
一步一步从 collect()方法 找会找到这段主要代码
collect()方法中找
1var finalStage: ResultStage = null 2 try { 3 // New stage creation may throw an exception if, for example, jobs are run on a 4 // HadoopRDD whose underlying HDFS files have been deleted. 5 finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite) 6 } catch { 7 case e: Exception => 8 logWarning("Creating new stage failed due to exception - job: " + jobId, e) 9 listener.jobFailed(e)10 return11 }finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
根据上面图片流程,程序需要找到最后一个Rdd然后创建ResultStage
ResultStage的创建
1private def createResultStage( 2 rdd: RDD[_], 3 func: (TaskContext, Iterator[_]) => _, 4 partitions: Array[Int], 5 jobId: Int, 6 callSite: CallSite): ResultStage = { 7 val parents = getOrCreateParentStages(rdd, jobId) 8 val id = nextStageId.getAndIncrement() 9 val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)10 stageIdToStage(id) = stage11 updateJobIdStageIdMaps(jobId, stage)12 stage13 }R14stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
parents = getOrCreateParentStages(rdd, jobId)
上面的createResultStage会创建一个ResultStage,同时给这个Stage 找到parents,也就是血缘依赖关系
3. getOrCreateParentStages(血缘依赖关系)
1private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {2 getShuffleDependencies(rdd).map { shuffleDep =>3 getOrCreateShuffleMapStage(shuffleDep, firstJobId)4 }.toList5 } 1private[scheduler] def getShuffleDependencies( 2 rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = { 3 val parents = new HashSet[ShuffleDependency[_, _, _]] 4 val visited = new HashSet[RDD[_]] 5 val waitingForVisit = new Stack[RDD[_]] 6 waitingForVisit.push(rdd) 7 while (waitingForVisit.nonEmpty) { 8 val toVisit = waitingForVisit.pop() 9 if (!visited(toVisit)) {10 visited += toVisit11 toVisit.dependencies.foreach {12 case shuffleDep: ShuffleDependency[_, _, _] =>13 parents += shuffleDep14 case dependency =>15 waitingForVisit.push(dependency.rdd)16 }17 }18 }19 parents20 }说明:假设A,B,C,D都是shuffle依赖,getShuffleDependencies(D)只返回B和C 然后把上面返回的B,C分别遍历,然后创建对应的Stage 即方法getOrCreateShuffleMapStage
4. getOrCreateShuffleMapStage
1private def getOrCreateShuffleMapStage( 2 shuffleDep: ShuffleDependency[_, _, _], 3 firstJobId: Int): ShuffleMapStage = { 4 shuffleIdToMapStage.get(shuffleDep.shuffleId) match { 5 case Some(stage) => 6 stage 7 8 case None => 9 // Create stages for all missing ancestor shuffle dependencies.10 getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>1112 if (!shuffleIdToMapStage.contains(dep.shuffleId)) {13 createShuffleMapStage(dep, firstJobId)14 }15 }16 // Finally, create a stage for the given shuffle dependency.17 createShuffleMapStage(shuffleDep, firstJobId)18 }19 }对于不存在的ShuffleMapStage, 调用createShuffleMapStage创建stage
5. ShuffleMapStage创建
1def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {2 val rdd = shuffleDep.rdd3 val numTasks = rdd.partitions.length4 val parents = getOrCreateParentStages(rdd, jobId)5 val id = nextStageId.getAndIncrement()6 val stage = new ShuffleMapStage(id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep)也即最后一句创建了ShuffleMapStage,剩下的就是提交Stage了
以上ResultStage和ShuffleMapStage创建好了(图中可体现过程)
6. handleJobSubmitted() 执行代码
1private[scheduler] def handleJobSubmitted(jobId: Int, 2 finalRDD: RDD[_], 3 func: (TaskContext, Iterator[_]) => _, 4 partitions: Array[Int], 5 callSite: CallSite, 6 listener: JobListener, 7 properties: Properties) { 8 var finalStage: ResultStage = null 9 try {10 // New stage creation may throw an exception if, for example, jobs are run on a11 // HadoopRDD whose underlying HDFS files have been deleted.12 finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)13 } catch {14 case e: Exception =>15 logWarning("Creating new stage failed due to exception - job: " + jobId, e)16 listener.jobFailed(e)17 return18 }1920 val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)21 clearCacheLocs()22 logInfo("Got job %s (%s) with %d output partitions".format(23 job.jobId, callSite.shortForm, partitions.length))24 logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")25 logInfo("Parents of final stage: " + finalStage.parents)26 logInfo("Missing parents: " + getMissingParentStages(finalStage))2728 val jobSubmissionTime = clock.getTimeMillis()29 jobIdToActiveJob(jobId) = job30 activeJobs += job31 finalStage.setActiveJob(job)32 val stageIds = jobIdToStageIds(jobId).toArray33 val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))34 listenerBus.post(35 SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))36 submitStage(finalStage)37 }val job = new ActiveJob(jobId, finalStage, callSite, listener, properties) finalStage.setActiveJob(job) 找到finalStage后(也即上面源码分析中的ResultStage),把最后阶段传了进来,需要和Job联系在一起
7. submitStage(finalStage)
1private def submitStage(stage: Stage) {2 val jobId = activeJobForStage(stage)3 if (jobId.isDefined) {4 logDebug("submitStage(" + stage + ")")5 if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {6 val missing = getMissingParentStages(stage).sortBy(_.id)把最后阶段的finalStage(ResultStage)交给了getMissingParentStages 主要目的是找前面的stage
8. getMissingParentStages()
1private def getMissingParentStages(stage: Stage): List[Stage] = { 2 val missing = new HashSet[Stage] 3 val visited = new HashSet[RDD[_]] 4 // We are manually maintaining a stack here to prevent StackOverflowError 5 // caused by recursively visiting 6 val waitingForVisit = new Stack[RDD[_]] 7 def visit(rdd: RDD[_]) { 8 if (!visited(rdd)) { 9 visited += rdd10 val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)11 if (rddHasUncachedPartitions) {12 for (dep <- rdd.dependencies) {13 dep match {14 case shufDep: ShuffleDependency[_, _, _] =>15 val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)16 if (!mapStage.isAvailable) {17 missing += mapStage18 }19 case narrowDep: NarrowDependency[_] =>20 waitingForVisit.push(narrowDep.rdd)21 }22 }23 }24 }25 }26 waitingForVisit.push(stage.rdd)27 while (waitingForVisit.nonEmpty) {28 visit(waitingForVisit.pop())29 }30 missing.toList31 }主要看def visit(rdd: RDD[_]) for (dep <- rdd.dependencies) 还是找ShuffleDependency 一直到找不到为止,会把ShuffleDependency添加到missing中(看有几个shuffle) 开始执行submitMissingTasks,执行的时候会找到有多少Task
9. submitMissingTasks()
1private def submitMissingTasks(stage: Stage, jobId: Int) { 2 val tasks: Seq[Task[_]] = try { 3 stage match { 4 case stage: ShuffleMapStage => 5 partitionsToCompute.map { id => 6 val locs = taskIdToLocations(id) 7 val part = stage.rdd.partitions(id) 8 new ShuffleMapTask(stage.id, stage.latestInfo.attemptId, 9 taskBinary, part, locs, stage.latestInfo.taskMetrics, properties, Option(jobId),10 Option(sc.applicationId), sc.applicationAttemptId)11 }1213 case stage: ResultStage =>14 partitionsToCompute.map { id =>15 val p: Int = stage.partitions(id)16 val part = stage.rdd.partitions(p)17 val locs = taskIdToLocations(id)18 new ResultTask(stage.id, stage.latestInfo.attemptId,19 taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics,20 Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)21 }22 }23}242526override def findMissingPartitions(): Seq[Int] = {27 val missing = (0 until numPartitions).filter(id => outputLocs(id).isEmpty)28 assert(missing.size == numPartitions - _numAvailableOutputs,29 s"${missing.size} missing, expected ${numPartitions - _numAvailableOutputs}")30 missing31 }如果ShuffleMapStage阶段最后的Rdd有两个分区 missing返回的就是 0 和 1
10. partitionsToCompute()
1partitionsToCompute.map { id =>2 val locs = taskIdToLocations(id)3 val part = stage.rdd.partitions(id)4 new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,5 taskBinary, part, locs, stage.latestInfo.taskMetrics, properties, Option(jobId),6 Option(sc.applicationId), sc.applicationAttemptId)7 }有两个分区,也就会new 两个 ShuffleMapTask,也就两个Task任务
匹配result的原理一样,不再阐述
11. 和第9步submitMissingTasks()同列代码
1if (tasks.size > 0) {2 logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")3 stage.pendingPartitions ++= tasks.map(_.partitionId)4 logDebug("New pending partitions: " + stage.pendingPartitions)5 taskScheduler.submitTasks(new TaskSet(6 tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))7 stage.latestInfo.submissionTime = Some(clock.getTimeMillis())8 }taskScheduler.submitTasks 提交任务
12. submitTasks()
1override def submitTasks(taskSet: TaskSet) { 2 val tasks = taskSet.tasks 3 logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks") 4 this.synchronized { 5 val manager = createTaskSetManager(taskSet, maxTaskFailures) 6 val stage = taskSet.stageId 7 val stageTaskSets = 8 taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager]) 9 stageTaskSets(taskSet.stageAttemptId) = manager10 val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>11 ts.taskSet != taskSet && !ts.isZombie12 }13......至此完了
微信公众号:王了个博
人要去的地方,除了远方,还有未来
欢迎关注我,一起学习,一起进步!