Spark源码分析(1) yarn cluster提交的流程

1,912 阅读5分钟

0 前言

最近准备要复习一下之前两年来所学Spark flink 等框架的源码,特地把当初的编译的源码拿出来看了一下,在此做个记录 方便以后回忆。

本文研究源码的Spark版本为2.2.0 scala版本为2.11 java版本1.8

1 我们先来看一下,spark提交任务的脚本,分别只有client客户端模式和cluster集群模式 我们主要看一下cluster的模式下的提交

spark-submit 
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 2g \
--executor-cores 2 \
--class org.apache.spark.examples.SparkPi \
${SPARK_HOME}/examples/jars/spark-examples.jar \

`

1.1 spark-submit.sh 提交脚本

#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0
#关键在于这个地方提交参数
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

2 参数解析 根据解析的参数进行反射调用

2.1 参数解析

进入到这个类org.apache.spark.deploy.SparkSubmit 会调用一个SparkSubmitArguments这个类(见下文中的代码)

val appArgs = new SparkSubmitArguments(args)

关键是使用 SparkSubmitArguments 类中的这个方法 parse 进行解析

try {
  parse(args.asJava)
} catch {
  case e: IllegalArgumentException =>
    SparkSubmit.printErrorAndExit(e.getMessage())
}
// Populate `sparkProperties` map from properties file
mergeDefaultSparkProperties()
// Remove keys that don't start with "spark." from `sparkProperties`.
ignoreNonSparkProperties()
// Use `sparkProperties` map along with env vars to fill in any missing parameters
loadEnvironmentArguments()

validateArguments()

args讲参数传入到SparkSubmitArguments构造函数中 会返回这样的一个类, 并且使用模式匹配根据action 这个参数进行 不同方式的选择调用我主要看一下submit这个方法

override def main(args: Array[String]): Unit = {
  // 传入的参数经过SparkSubmitArguments 进行解析
  val appArgs = new SparkSubmitArguments(args)
  if (appArgs.verbose) {
    // scalastyle:off println
    printStream.println(appArgs)
    // scalastyle:on println
  }
  appArgs.action match {
    // TODO 这个地方是来submit sh 脚本传过来的参数
    case SparkSubmitAction.SUBMIT => submit(appArgs)
    case SparkSubmitAction.KILL => kill(appArgs)
    case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
  }
}

进入submit这个方法中去 对环境变量用prepareSubmitEnvironment这个方法进行解析

val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)

关键是返回的四个参数

val childArgs = new ArrayBuffer[String]() 主要为一些参数

val childClasspath = new ArrayBuffer[String]()主要是类路径

val sysProps = new HashMap[String, String]()java的系统参数

var childMainClass = ""

其中childMainClass 最为关键代表使用哪张方式提交到yarn

不同提交模式返回childMainClass是不同,我们主要关注的是cluser模式在这个集群模式下就使用这个类org.apache.spark.deploy.yarn.Client在后续方法中进行反射调用main方法.

// 在集群模式下使用 就使用"org.apache.spark.deploy.yarn.Client 这个类
if (isYarnCluster) {
  childMainClass = "org.apache.spark.deploy.yarn.Client"
  if (args.isPython) {
    childArgs += ("--primary-py-file", args.primaryResource)
    childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
  } else if (args.isR) {
    val mainFile = new Path(args.primaryResource).getName
    childArgs += ("--primary-r-file", mainFile)
    childArgs += ("--class", "org.apache.spark.deploy.RRunner")
  } else {
    if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
      childArgs += ("--jar", args.primaryResource)
    }
    childArgs += ("--class", args.mainClass)
  }
  if (args.childArgs != null) {
    args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
  }
}

具体逻辑还是使用doRunMain这个方法进行

if (args.isStandaloneCluster && args.useRest) {
  try {
    // scalastyle:off println
    printStream.println("Running Spark using the REST application submission protocol.")
    // scalastyle:on println
    doRunMain()
  } catch {
    // Fail over to use the legacy submission gateway
    case e: SubmitRestConnectionException =>
      printWarning(s"Master endpoint ${args.master} was not a REST server. " +
        "Falling back to legacy submission gateway instead.")
      args.useRest = false
      submit(args)
  }
// In all other modes, just run the main class as prepared
} else {
  doRunMain()
}

进入到doRunMain 这个方法中主要还是使用runMain这个方法 传入刚刚四个返回的参数




  def doRunMain(): Unit = {
    if (args.proxyUser != null) {
      val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
        UserGroupInformation.getCurrentUser())
      try {
        proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
          override def run(): Unit = {
            runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
          }
        })
      }

2.2 进入到runMain 方法实际进行反射调用

我们进入到runMain方法中可以发现此处代码 通过我们刚刚上文提到 反射方式对org.apache.spark.deploy.yarn.Client 这个类中启动main方法


val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
if (!Modifier.isStatic(mainMethod.getModifiers)) {
  throw new IllegalStateException("The main method in the given main class must be static")
}

@tailrec
def findCause(t: Throwable): Throwable = t match {
  case e: UndeclaredThrowableException =>
    if (e.getCause() != null) findCause(e.getCause()) else e
  case e: InvocationTargetException =>
    if (e.getCause() != null) findCause(e.getCause()) else e
  case e: Throwable =>
    e
}

try {
  mainMethod.invoke(null, childArgs.toArray)
} catch {
  case t: Throwable =>
    findCause(t) match {
      case SparkUserAppException(exitCode) =>
        System.exit(exitCode)

      case t: Throwable =>
        throw t
    }
}

3 yarn客户端的提交的探索

3.1 org.apache.spark.deploy.yarn.Client这个类进行初探

因为上文中反射调用的是main方法 其中关键在于使用客户端Client 进行run方法的调用

def main(argStrings: Array[String]) {
  if (!sys.props.contains("SPARK_SUBMIT")) {
    logWarning("WARNING: This client is deprecated and will be removed in a " +
      "future version of Spark. Use ./bin/spark-submit with "--master yarn"")
  }

  // Set an env variable indicating we are running in YARN mode.
  // Note that any env variable with the SPARK_ prefix gets propagated to all (remote) processes
  System.setProperty("SPARK_YARN_MODE", "true")
  val sparkConf = new SparkConf
  // SparkSubmit would use yarn cache to distribute files & jars in yarn mode,
  // so remove them from sparkConf here for yarn mode.
  sparkConf.remove("spark.jars")
  sparkConf.remove("spark.files")
  val args = new ClientArguments(argStrings)
  new Client(args, sparkConf).run()
}

run方法中关键的在submitApplication()的调用

def run(): Unit = {
  this.appId = submitApplication()
  if (!launcherBackend.isConnected() && fireAndForget) {
    val report = getApplicationReport(appId)
    val state = report.getYarnApplicationState
    logInfo(s"Application report for $appId (state: $state)")
    logInfo(formatReportDetails(report))
    if (state == YarnApplicationState.FAILED || state == YarnApplicationState.KILLED) {
      throw new SparkException(s"Application $appId finished with status: $state")
    }
  } else {
    val (yarnApplicationState, finalApplicationStatus) = monitorApplication(appId)
    if (yarnApplicationState == YarnApplicationState.FAILED ||
      finalApplicationStatus == FinalApplicationStatus.FAILED) {
      throw new SparkException(s"Application $appId finished with failed status")
    }
    if (yarnApplicationState == YarnApplicationState.KILLED ||
      finalApplicationStatus == FinalApplicationStatus.KILLED) {
      throw new SparkException(s"Application $appId is killed")
    }
    if (finalApplicationStatus == FinalApplicationStatus.UNDEFINED) {
      throw new SparkException(s"The final status of application $appId is undefined")
    }
  }
}

submitApplication这个方法中会初始化和启动yarnClient,向resourceManager 申请资源并且最后返回一个appid 这个方法中的关键在于createContainerLaunchContext这个方法

def submitApplication(): ApplicationId = {
  var appId: ApplicationId = null
  try {
    launcherBackend.connect()
    // Setup the credentials before doing anything else,
    // so we have don't have issues at any point.
    setupCredentials()
    // TODO 1. YARN客户端初始化和启动
    // 初始化YARN客户端
    yarnClient.init(yarnConf)
    // 启动YARN客户端
    yarnClient.start()

    logInfo("Requesting a new application from cluster with %d NodeManagers"
      .format(yarnClient.getYarnClusterMetrics.getNumNodeManagers))

    // Get a new application from our RM
    // TODO 2. 向RM请求创建新的YARN应用,拿到appId
    val newApp = yarnClient.createApplication()
    val newAppResponse = newApp.getNewApplicationResponse()
    appId = newAppResponse.getApplicationId()

    new CallerContext("CLIENT", sparkConf.get(APP_CALLER_CONTEXT),
      Option(appId.toString)).setCurrentContext()

    // Verify whether the cluster has enough resources for our AM
    // 检查资源是否充足
    verifyClusterResources(newAppResponse)

    // Set up the appropriate contexts to launch our AM
    // 3. 创建container启动上下文 为ApplicationMaster创建上下文
    val containerContext = createContainerLaunchContext(newAppResponse)
    //
    val appContext = createApplicationSubmissionContext(newApp, containerContext)

    // Finally, submit and monitor the application
    logInfo(s"Submitting application $appId to ResourceManager")
    yarnClient.submitApplication(appContext)
    launcherBackend.setAppId(appId.toString)
    reportLauncherState(SparkAppHandle.State.SUBMITTED)

    appId
  } catch {
    case e: Throwable =>
      if (appId != null) {
        cleanupStagingDir(appId)
      }
      throw e
  }
}

3.2 createContainerLaunchContext方法的研究探索.

这个方法很长我们看其中的关键部分

// 创建启动的环境
val launchEnv = setupLaunchEnv(appStagingDirPath, pySparkArchives)
val localResources = prepareLocalResources(appStagingDirPath, pySparkArchives)

java参数的配置 像虚拟机的配置这些


val javaOpts = ListBuffer[String]()

// Set the environment variable through a command prefix
// to append to the existing value of the variable
var prefixEnv: Option[String] = None
// TODO 设置java参数 jvm 这些东西
// Add Xmx for AM memory
javaOpts += "-Xmx" + amMemory + "m"

val tmpDir = new Path(Environment.PWD.$$(), YarnConfiguration.DEFAULT_CONTAINER_TEMP_DIR)
javaOpts += "-Djava.io.tmpdir=" + tmpDir

// TODO: Remove once cpuset version is pushed out.
// The context is, default gc for server class machines ends up using all cores to do gc -
// hence if there are multiple containers in same node, Spark GC affects all other containers'
// performance (which can be that of other Spark containers)
// Instead of using this, rely on cpusets by YARN to enforce "proper" Spark behavior in
// multi-tenant environments. Not sure how default Java GC behaves if it is limited to subset
// of cores on a node.
val useConcurrentAndIncrementalGC = launchEnv.get("SPARK_USE_CONC_INCR_GC").exists(_.toBoolean)
if (useConcurrentAndIncrementalGC) {
  // In our expts, using (default) throughput collector has severe perf ramifications in
  // multi-tenant machines
  javaOpts += "-XX:+UseConcMarkSweepGC"
  javaOpts += "-XX:MaxTenuringThreshold=31"
  javaOpts += "-XX:SurvivorRatio=8"
  javaOpts += "-XX:+CMSIncrementalMode"
  javaOpts += "-XX:+CMSIncrementalPacing"
  javaOpts += "-XX:CMSIncrementalDutyCycleMin=0"
  javaOpts += "-XX:CMSIncrementalDutyCycle=10"
}

还有最关键的地方就是amClass的生成 不用提交模式使用不同类来供以后反射

// TODO 这个地方是关键 如果是集群模式 下 org.apache.spark.deploy.yarn.ApplicationMaster 像一个
val amClass =
  if (isClusterMode) {
    Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
  } else {
    Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
  }

此处关键是个调用了java的路径代表 要起一个ApplicationMaster是jvm进程,就是一个command指令 提交给resourceManager resourceManager根据指令在 nodeManager启动上文提到的org.apache.spark.deploy.yarn.ApplicationMaster的main方法

// Command for the ApplicationMaster
val commands = prefixEnv ++
  Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
  javaOpts ++ amArgs ++
  Seq(
    "1>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout",
    "2>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr")

4 org.apache.spark.deploy.yarn.ApplicationMaster探索

我们在这个类中发现

master = new ApplicationMaster(amArgs, new YarnRMClient) 然后调用了 master.run() 这个方法 关键在于run这个方法

def main(args: Array[String]): Unit = {
  SignalUtils.registerLogger(log)
  val amArgs = new ApplicationMasterArguments(args)

  // Load the properties file with the Spark configuration and set entries as system properties,
  // so that user code run inside the AM also has access to them.
  // Note: we must do this before SparkHadoopUtil instantiated
  if (amArgs.propertiesFile != null) {
    Utils.getPropertiesFromFile(amArgs.propertiesFile).foreach { case (k, v) =>
      sys.props(k) = v
    }
  }
  //TODO new 一个ApplicationMaster 调用run方法
  SparkHadoopUtil.get.runAsSparkUser { () =>
    master = new ApplicationMaster(amArgs, new YarnRMClient)
    System.exit(master.run())
  }
}

进入到run方法中 经过一些环境变量的设置和退出时候hook的设置还有安全方面的配置之后 关键在于 if (isClusterMode) { runDriver(securityMgr) } else {runExecutorLauncher(securityMgr) } 其中runDriver的这个方法


// 此处是ApplicationMaster 的具体执行的东西
final def run(): Int = {
  try {
    val appAttemptId = client.getAttemptId()

    var attemptID: Option[String] = None
    // 集群模式 设置参数 端口 参数等方式
    if (isClusterMode) {
      // Set the web ui port to be ephemeral for yarn so we don't conflict with
      // other spark processes running on the same box
      System.setProperty("spark.ui.port", "0")

      // Set the master and deploy mode property to match the requested mode.
      System.setProperty("spark.master", "yarn")
      System.setProperty("spark.submit.deployMode", "cluster")

      // Set this internal configuration if it is running on cluster mode, this
      // configuration will be checked in SparkContext to avoid misuse of yarn cluster mode.
      System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString())

      attemptID = Option(appAttemptId.getAttemptId.toString)
    }
    // 提交application时,会new一个CallerContext
    // 对象并调用反射机制进行Hadoop那边的CallerContext注册 从而在rm-audit.log中额外打印提交的Spark Client信息
    new CallerContext(
      "APPMASTER", sparkConf.get(APP_CALLER_CONTEXT),
      Option(appAttemptId.getApplicationId.toString), attemptID).setCurrentContext()

    logInfo("ApplicationAttemptId: " + appAttemptId)
    //  TODO   Java程序经常也会遇到进程挂掉的情况,一些状态没有正确的保存下来,这时候就需要在JVM关掉的时候执行一些清理现场的代码。
    //
    //  
    //
    //  JDK提供了Java.Runtime.addShutdownHook(Thread hook)方法,可以注册一个JVM关闭的钩子,这个钩子可以在一下几种场景中被调用:
    //   是在SparkContext中,为了在Spark程序挂掉的时候,处理一些清理工作
    // This shutdown hook should run *after* the SparkContext is shut down.
    val priority = ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY - 1
    ShutdownHookManager.addShutdownHook(priority) { () =>
      val maxAppAttempts = client.getMaxRegAttempts(sparkConf, yarnConf)
      val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts

      if (!finished) {
        // The default state of ApplicationMaster is failed if it is invoked by shut down hook.
        // This behavior is different compared to 1.x version.
        // If user application is exited ahead of time by calling System.exit(N), here mark
        // this application as failed with EXIT_EARLY. For a good shutdown, user shouldn't call
        // System.exit(0) to terminate the application.
        finish(finalStatus,
          ApplicationMaster.EXIT_EARLY,
          "Shutdown hook called before final status was reported.")
      }

      if (!unregistered) {
        // we only want to unregister if we don't want the RM to retry
        if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) {
          unregister(finalStatus, finalMsg)
          cleanupStagingDir()
        }
      }
    }

    // Call this to force generation of secret so it gets populated into the
    // Hadoop UGI. This has to happen before the startUserApplication which does a
    // doAs in order for the credentials to be passed on to the executor containers.
    val securityMgr = new SecurityManager(sparkConf)
    // 安全定期更好令牌
    // If the credentials file config is present, we must periodically renew tokens. So create
    // a new AMDelegationTokenRenewer
    if (sparkConf.contains(CREDENTIALS_FILE_PATH.key)) {
      // If a principal and keytab have been set, use that to create new credentials for executors
      // periodically
      credentialRenewer =
        new ConfigurableCredentialManager(sparkConf, yarnConf).credentialRenewer()
      credentialRenewer.scheduleLoginFromKeytab()
    }
    // TODO 此处是关键 集权模式下这个就启动Driver的地方
    if (isClusterMode) {
      runDriver(securityMgr)
    } else {
      runExecutorLauncher(securityMgr)
    }
  } catch {
    case e: Exception =>
      // catch everything else if not specifically handled
      logError("Uncaught exception: ", e)
      finish(FinalApplicationStatus.FAILED,
        ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION,
        "Uncaught exception: " + e)
  }
  exitCode
}

进入到runDriver这个方法中去 我们发现先使用startUserApplication这个方法进行进行创建一个线程,同时阻塞住 等待这个初始化driver线程和sparkContext 初始化rpc的通信环境,

private def runDriver(securityMgr: SecurityManager): Unit = {
  addAmIpFilter()
  //TODO 启动应用程序一个线程 driver 段 就是传入的那个主类
  userClassThread = startUserApplication()

  // This a bit hacky, but we need to wait until the spark.driver.port property has
  // been set by the Thread executing the user class.
  logInfo("Waiting for spark context initialization...")
  //初始化spark context
  val totalWaitTime = sparkConf.get(AM_MAX_WAIT_TIME)
  try {
    // TODO 等待上下文 环境对象 意味走到RUNdriver 启动一个应用程序  要等待 上下文对象初始好
    val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
      Duration(totalWaitTime, TimeUnit.MILLISECONDS))
    if (sc != null) {
      // rpcENV 通信环境
      rpcEnv = sc.env.rpcEnv
      val driverRef = runAMEndpoint(
        sc.getConf.get("spark.driver.host"),
        sc.getConf.get("spark.driver.port"),
        isClusterMode = true)
      // TODO 当把ApplicationMaster 注册好了 获取rpc 就要向ResourceManager  注册am 申请资源
      registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.webUrl), securityMgr)
    } else {
      // Sanity check; should never happen in normal operation, since sc should only be null
      // if the user app did not create a SparkContext.
      if (!finished) {
        throw new IllegalStateException("SparkContext is null but app is still running!")
      }
    }
    userClassThread.join()
  } catch {
    case e: SparkException if e.getCause().isInstanceOf[TimeoutException] =>
      logError(
        s"SparkContext did not initialize after waiting for $totalWaitTime ms. " +
         "Please check earlier log output for errors. Failing the application.")
      finish(FinalApplicationStatus.FAILED,
        ApplicationMaster.EXIT_SC_NOT_INITED,
        "Timed out waiting for SparkContext.")
  }
}

我们进入到startUserApplication这个方法中去可以发现就是初始化了一个线程来运行我们传入的那个类的driver 例如org.apache.spark.examples.SparkPi 这个类的main方法

val mainMethod = userClassLoader.loadClass(args.userClass)
  .getMethod("main", classOf[Array[String]])
// TODO 启动了一个线程来初始化
val userThread = new Thread {
  override def run() {
    try {
      mainMethod.invoke(null, userArgs.toArray)
      finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
      logDebug("Done running users class")
    } catch {
      case e: InvocationTargetException =>
        e.getCause match {
          case _: InterruptedException =>
            // Reporter thread can interrupt to stop user class
          case SparkUserAppException(exitCode) =>
            val msg = s"User application exited with status $exitCode"
            logError(msg)
            finish(FinalApplicationStatus.FAILED, exitCode, msg)
          case cause: Throwable =>
            logError("User class threw exception: " + cause, cause)
            finish(FinalApplicationStatus.FAILED,
              ApplicationMaster.EXIT_EXCEPTION_USER_CLASS,
              "User class threw exception: " + cause)
        }
        sparkContextPromise.tryFailure(e.getCause())
    } finally {
      // Notify the thread waiting for the SparkContext, in case the application did not
      // instantiate one. This will do nothing when the user code instantiates a SparkContext
      // (with the correct master), or when the user code throws an exception (due to the
      // tryFailure above).
      sparkContextPromise.trySuccess(null)
    }
  }
}
userThread.setContextClassLoader(userClassLoader)
userThread.setName("Driver")
userThread.start()
userThread

我们回到runDriver 中会发现 初始完成rpc之后 我们会 注册am 申请资源 其中关键的方法在于registerAM

if (sc != null) {
  // rpcENV 通信环境
  rpcEnv = sc.env.rpcEnv
  val driverRef = runAMEndpoint(
    sc.getConf.get("spark.driver.host"),
    sc.getConf.get("spark.driver.port"),
    isClusterMode = true)
  // TODO 当把ApplicationMaster 注册好了 获取rpc 就要向ResourceManager  注册am 申请资源
  registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.webUrl), securityMgr)
}

registerAM 中调用yarn客户端注册申请资源之后 会调用allocateResources来分配资源

// 关键注册registerAM
private def registerAM(
    _sparkConf: SparkConf,
    _rpcEnv: RpcEnv,
    driverRef: RpcEndpointRef,
    uiAddress: Option[String],
    securityMgr: SecurityManager) = {
  val appId = client.getAttemptId().getApplicationId().toString()
  val attemptId = client.getAttemptId().getAttemptId().toString()
  val historyAddress =
    _sparkConf.get(HISTORY_SERVER_ADDRESS)
      .map { text => SparkHadoopUtil.get.substituteHadoopVariables(text, yarnConf) }
      .map { address => s"${address}${HistoryServer.UI_PATH_PREFIX}/${appId}/${attemptId}" }
      .getOrElse("")

  val driverUrl = RpcEndpointAddress(
    _sparkConf.get("spark.driver.host"),
    _sparkConf.get("spark.driver.port").toInt,
    CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString

  // Before we initialize the allocator, let's log the information about how executors will
  // be run up front, to avoid printing this out for every single executor being launched.
  // Use placeholders for information that changes such as executor IDs.
  logInfo {
    val executorMemory = sparkConf.get(EXECUTOR_MEMORY).toInt
    val executorCores = sparkConf.get(EXECUTOR_CORES)
    val dummyRunner = new ExecutorRunnable(None, yarnConf, sparkConf, driverUrl, "<executorId>",
      "<hostname>", executorMemory, executorCores, appId, securityMgr, localResources)
    dummyRunner.launchContextDebugInfo()
  }
   // yarn 客户端
  allocator = client.register(driverUrl,
    driverRef,
    yarnConf,
    _sparkConf,
    uiAddress,
    historyAddress,
    securityMgr,
    localResources)

  // yarn 客户端 分配器 获得资源
  allocator.allocateResources()
  reporterThread = launchReporterThread()
}

分配资源还得并且处理各个container 之间的关系这个我们要点到handleAllocatedContainers 这个方法中进一步看一下 处理分配的资源 并把容器 进行分类 比如在 现归类处于同一节点上的的am 再归类属于同一个rack(机架上的) 最后把都不属于的归在一起 分配完成之后 调用

def allocateResources(): Unit = synchronized {
  updateResourceRequests()

  val progressIndicator = 0.1f
  // Poll the ResourceManager. This doubles as a heartbeat if there are no pending container
  // requests.
  val allocateResponse = amClient.allocate(progressIndicator)

  val allocatedContainers = allocateResponse.getAllocatedContainers()
  // 如果资源充足 就分配资源处理
  if (allocatedContainers.size > 0) {
    logDebug("Allocated containers: %d. Current executor count: %d. Cluster resources: %s."
      .format(
        allocatedContainers.size,
        numExecutorsRunning,
        allocateResponse.getAvailableResources))

    handleAllocatedContainers(allocatedContainers.asScala)
  }

  val completedContainers = allocateResponse.getCompletedContainersStatuses()
  if (completedContainers.size > 0) {
    logDebug("Completed %d containers".format(completedContainers.size))
    processCompletedContainers(completedContainers.asScala)
    logDebug("Finished processing %d completed containers. Current running executor count: %d."
      .format(completedContainers.size, numExecutorsRunning))
  }
}
def handleAllocatedContainers(allocatedContainers: Seq[Container]): Unit = {
  val containersToUse = new ArrayBuffer[Container](allocatedContainers.size)

  // Match incoming requests by host 处于同一个节点上
  val remainingAfterHostMatches = new ArrayBuffer[Container]
  for (allocatedContainer <- allocatedContainers) {
    // 按照节点或者机架 进行归类的一个功能 用到的放到containersToUse 剩余的放到remainingAfterHostMatches
    matchContainerToRequest(allocatedContainer, allocatedContainer.getNodeId.getHost,
      containersToUse, remainingAfterHostMatches)
  }

  // Match remaining by rack
  val remainingAfterRackMatches = new ArrayBuffer[Container]
  for (allocatedContainer <- remainingAfterHostMatches) {
    val rack = resolver.resolve(conf, allocatedContainer.getNodeId.getHost)
    matchContainerToRequest(allocatedContainer, rack, containersToUse,
      remainingAfterRackMatches)
  }

  // Assign remaining that are neither node-local nor rack-local
  val remainingAfterOffRackMatches = new ArrayBuffer[Container]
  for (allocatedContainer <- remainingAfterRackMatches) {
    matchContainerToRequest(allocatedContainer, ANY_HOST, containersToUse,
      remainingAfterOffRackMatches)
  }
    // 剩余的没有归属的进行释放
  if (!remainingAfterOffRackMatches.isEmpty) {
    logDebug(s"Releasing ${remainingAfterOffRackMatches.size} unneeded containers that were " +
      s"allocated to us")
    for (container <- remainingAfterOffRackMatches) {
      internalReleaseContainer(container)
    }
  }
  // 运行所有分配的container
  runAllocatedContainers(containersToUse)

  logInfo("Received %d containers from YARN, launching executors on %d of them."
    .format(allocatedContainers.size, containersToUse.size))
}

调用这个runAllocatedContainers(containersToUse)来运行containers,里面实际上市调用了一个线程池来异步执行各个ExecutorRunnable.run方法

private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = {
  for (container <- containersToUse) {
    executorIdCounter += 1
    val executorHostname = container.getNodeId.getHost
    val containerId = container.getId
    val executorId = executorIdCounter.toString
    assert(container.getResource.getMemory >= resource.getMemory)
    logInfo(s"Launching container $containerId on host $executorHostname " +
      s"for executor with ID $executorId")

    def updateInternalState(): Unit = synchronized {
      numExecutorsRunning += 1
      executorIdToContainer(executorId) = container
      containerIdToExecutorId(container.getId) = executorId

      val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
        new HashSet[ContainerId])
      containerSet += containerId
      allocatedContainerToHostMap.put(containerId, executorHostname)
    }

    if (numExecutorsRunning < targetNumExecutors) {
      if (launchContainers) {
        launcherPool.execute(new Runnable {
          override def run(): Unit = {
            try {
              new ExecutorRunnable(
                Some(container),
                conf,
                sparkConf,
                driverUrl,
                executorId,
                executorHostname,
                executorMemory,
                executorCores,
                appAttemptId.getApplicationId.toString,
                securityMgr,
                localResources
              ).run()
              updateInternalState()
            } catch {
              case NonFatal(e) =>
                logError(s"Failed to launch executor $executorId on container $containerId", e)
                // Assigned container should be released immediately to avoid unnecessary resource
                // occupation.
                amClient.releaseAssignedContainer(containerId)
            }
          }
        })
      } else {
        // For test only
        updateInternalState()
      }
    } else {
      logInfo(("Skip launching executorRunnable as runnning Excecutors count: %d " +
        "reached target Executors count: %d.").format(numExecutorsRunning, targetNumExecutors))
    }
  }
}

run方法中创建nodeManagerClient(NMClient) 同时初始化环境 启动Container

def run(): Unit = {
  logDebug("Starting Executor Container")
  nmClient = NMClient.createNMClient()
  nmClient.init(conf)
  nmClient.start()
  startContainer()
}

关键在于startContainer这个方法

def startContainer(): java.util.Map[String, ByteBuffer] = {
  val ctx = Records.newRecord(classOf[ContainerLaunchContext])
    .asInstanceOf[ContainerLaunchContext]
  val env = prepareEnvironment().asJava

  ctx.setLocalResources(localResources.asJava)
  ctx.setEnvironment(env)

  val credentials = UserGroupInformation.getCurrentUser().getCredentials()
  val dob = new DataOutputBuffer()
  credentials.writeTokenStorageToStream(dob)
  ctx.setTokens(ByteBuffer.wrap(dob.getData()))

  val commands = prepareCommand()

  ctx.setCommands(commands.asJava)
  ctx.setApplicationACLs(
    YarnSparkHadoopUtil.getApplicationAclsForYarn(securityMgr).asJava)

  // If external shuffle service is enabled, register with the Yarn shuffle service already
  // started on the NodeManager and, if authentication is enabled, provide it with our secret
  // key for fetching shuffle files later
  if (sparkConf.get(SHUFFLE_SERVICE_ENABLED)) {
    val secretString = securityMgr.getSecretKey()
    val secretBytes =
      if (secretString != null) {
        // This conversion must match how the YarnShuffleService decodes our secret
        JavaUtils.stringToBytes(secretString)
      } else {
        // Authentication is not enabled, so just provide dummy metadata
        ByteBuffer.allocate(0)
      }
    ctx.setServiceData(Collections.singletonMap("spark_shuffle", secretBytes))
  }

  // Send the start request to the ContainerManager
  try {
    nmClient.startContainer(container.get, ctx)
  } catch {
    case ex: Exception =>
      throw new SparkException(s"Exception while starting container ${container.get.getId}" +
        s" on host $hostname", ex)
  }
}

此处的关键在于 val commands = prepareCommand() 这个段代码点进去看一下 我们发现执行的一个 jvm 进程,进程的主类是

private def prepareCommand(): List[String] = {
  // Extra options for the JVM
  val javaOpts = ListBuffer[String]()

  // Set the environment variable through a command prefix
  // to append to the existing value of the variable
  var prefixEnv: Option[String] = None

  // Set the JVM memory
  val executorMemoryString = executorMemory + "m"
  javaOpts += "-Xmx" + executorMemoryString

  // Set extra Java options for the executor, if defined
  sparkConf.get(EXECUTOR_JAVA_OPTIONS).foreach { opts =>
    javaOpts ++= Utils.splitCommandString(opts).map(YarnSparkHadoopUtil.escapeForShell)
  }
  sparkConf.get(EXECUTOR_LIBRARY_PATH).foreach { p =>
    prefixEnv = Some(Client.getClusterPath(sparkConf, Utils.libraryPathEnvPrefix(Seq(p))))
  }

  javaOpts += "-Djava.io.tmpdir=" +
    new Path(Environment.PWD.$$(), YarnConfiguration.DEFAULT_CONTAINER_TEMP_DIR)

  // Certain configs need to be passed here because they are needed before the Executor
  // registers with the Scheduler and transfers the spark configs. Since the Executor backend
  // uses RPC to connect to the scheduler, the RPC settings are needed as well as the
  // authentication settings.
  sparkConf.getAll
    .filter { case (k, v) => SparkConf.isExecutorStartupConf(k) }
    .foreach { case (k, v) => javaOpts += YarnSparkHadoopUtil.escapeForShell(s"-D$k=$v") }

  // Commenting it out for now - so that people can refer to the properties if required. Remove
  // it once cpuset version is pushed out.
  // The context is, default gc for server class machines end up using all cores to do gc - hence
  // if there are multiple containers in same node, spark gc effects all other containers
  // performance (which can also be other spark containers)
  // Instead of using this, rely on cpusets by YARN to enforce spark behaves 'properly' in
  // multi-tenant environments. Not sure how default java gc behaves if it is limited to subset
  // of cores on a node.
  /*
      else {
        // If no java_opts specified, default to using -XX:+CMSIncrementalMode
        // It might be possible that other modes/config is being done in
        // spark.executor.extraJavaOptions, so we don't want to mess with it.
        // In our expts, using (default) throughput collector has severe perf ramifications in
        // multi-tenant machines
        // The options are based on
        // http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html#0.0.0.%20When%20to%20Use
        // %20the%20Concurrent%20Low%20Pause%20Collector|outline
        javaOpts += "-XX:+UseConcMarkSweepGC"
        javaOpts += "-XX:+CMSIncrementalMode"
        javaOpts += "-XX:+CMSIncrementalPacing"
        javaOpts += "-XX:CMSIncrementalDutyCycleMin=0"
        javaOpts += "-XX:CMSIncrementalDutyCycle=10"
      }
  */

  // For log4j configuration to reference
  javaOpts += ("-Dspark.yarn.app.container.log.dir=" + ApplicationConstants.LOG_DIR_EXPANSION_VAR)

  val userClassPath = Client.getUserClasspath(sparkConf).flatMap { uri =>
    val absPath =
      if (new File(uri.getPath()).isAbsolute()) {
        Client.getClusterPath(sparkConf, uri.getPath())
      } else {
        Client.buildPath(Environment.PWD.$(), uri.getPath())
      }
    Seq("--user-class-path", "file:" + absPath)
  }.toSeq

  YarnSparkHadoopUtil.addOutOfMemoryErrorArgument(javaOpts)
  val commands = prefixEnv ++
    Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
    javaOpts ++
    Seq("org.apache.spark.executor.CoarseGrainedExecutorBackend",
      "--driver-url", masterAddress,
      "--executor-id", executorId,
      "--hostname", hostname,
      "--cores", executorCores.toString,
      "--app-id", appId) ++
    userClassPath ++
    Seq(
      s"1>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stdout",
      s"2>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stderr")

  // TODO: it would be nicer to just make sure there are no null commands here
  commands.map(s => if (s == null) "null" else s).toList
}

此处的关键在 org.apache.spark.executor.CoarseGrainedExecutorBackend

val commands = prefixEnv ++
  Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
  javaOpts ++
  Seq("org.apache.spark.executor.CoarseGrainedExecutorBackend",
    "--driver-url", masterAddress,
    "--executor-id", executorId,
    "--hostname", hostname,
    "--cores", executorCores.toString,
    "--app-id", appId) ++
  userClassPath ++
  Seq(
    s"1>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stdout",
    s"2>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stderr")

// TODO: it would be nicer to just make sure there are no null commands here
commands.map(s => if (s == null) "null" else s).toList

Container 在执行是org.apache.spark.executor.CoarseGrainedExecutorBackend这个main方法

5 org.apache.spark.executor.CoarseGrainedExecutorBackend 初探

我先看一下main方法

def main(args: Array[String]) {
  var driverUrl: String = null
  var executorId: String = null
  var hostname: String = null
  var cores: Int = 0
  var appId: String = null
  var workerUrl: Option[String] = None
  val userClassPath = new mutable.ListBuffer[URL]()

  var argv = args.toList
  while (!argv.isEmpty) {
    argv match {
      case ("--driver-url") :: value :: tail =>
        driverUrl = value
        argv = tail
      case ("--executor-id") :: value :: tail =>
        executorId = value
        argv = tail
      case ("--hostname") :: value :: tail =>
        hostname = value
        argv = tail
      case ("--cores") :: value :: tail =>
        cores = value.toInt
        argv = tail
      case ("--app-id") :: value :: tail =>
        appId = value
        argv = tail
      case ("--worker-url") :: value :: tail =>
        // Worker url is used in spark standalone mode to enforce fate-sharing with worker
        workerUrl = Some(value)
        argv = tail
      case ("--user-class-path") :: value :: tail =>
        userClassPath += new URL(value)
        argv = tail
      case Nil =>
      case tail =>
        // scalastyle:off println
        System.err.println(s"Unrecognized options: ${tail.mkString(" ")}")
        // scalastyle:on println
        printUsageAndExit()
    }
  }

  if (driverUrl == null || executorId == null || hostname == null || cores <= 0 ||
    appId == null) {
    printUsageAndExit()
  }

  run(driverUrl, executorId, hostname, cores, appId, workerUrl, userClassPath)
  System.exit(0)
}

前面是参数的配置 关键在于 run 这个方法 我们点进去看一下

private def run(
    driverUrl: String,
    executorId: String,
    hostname: String,
    cores: Int,
    appId: String,
    workerUrl: Option[String],
    userClassPath: Seq[URL]) {

  Utils.initDaemon(log)

  SparkHadoopUtil.get.runAsSparkUser { () =>
    // Debug code
    Utils.checkHost(hostname)

    // Bootstrap to fetch the driver's Spark properties.
    val executorConf = new SparkConf
    val port = executorConf.getInt("spark.executor.port", 0)
    val fetcher = RpcEnv.create(
      "driverPropsFetcher",
      hostname,
      port,
      executorConf,
      new SecurityManager(executorConf),
      clientMode = true)
    val driver = fetcher.setupEndpointRefByURI(driverUrl)
    val cfg = driver.askSync[SparkAppConfig](RetrieveSparkAppConfig)
    val props = cfg.sparkProperties ++ Seq[(String, String)](("spark.app.id", appId))
    fetcher.shutdown()

    // Create SparkEnv using properties we fetched from the driver.
    val driverConf = new SparkConf()
    for ((key, value) <- props) {
      // this is required for SSL in standalone mode
      if (SparkConf.isExecutorStartupConf(key)) {
        driverConf.setIfMissing(key, value)
      } else {
        driverConf.set(key, value)
      }
    }
    if (driverConf.contains("spark.yarn.credentials.file")) {
      logInfo("Will periodically update credentials from: " +
        driverConf.get("spark.yarn.credentials.file"))
      SparkHadoopUtil.get.startCredentialUpdater(driverConf)
    }

    val env = SparkEnv.createExecutorEnv(
      driverConf, executorId, hostname, port, cores, cfg.ioEncryptionKey, isLocal = false)

    env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(
      env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env))
    workerUrl.foreach { url =>
      env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
    }
    env.rpcEnv.awaitTermination()
    SparkHadoopUtil.get.stopCredentialUpdater()
  }
}

这个里面关键的逻辑在于SparkEnv 环境的创建 和通信节点的创建,我们可以看到 关键在于 setupEndpointsetupEndpoint 用于和 ApplicationMaster 进行通信

6 总体流程

图片来自于网上 侵删

image.png

image.png 实际过程

1 执行脚本提交任务,实际是启动一个 SparkSubmit 的 JVM 进程;

2 SparkSubmit 类中的 main 方法反射调用 org.apache.spark.deploy.yarn.Client 的 main 方法;

3 Client 创建 Yarn 客户端,然后向 Yarn 服务器发送执行指令:bin/java ApplicationMaster;

4 Yarn 框架收到指令后会在指定的 NM 中启动 ApplicationMaster;

5 ApplicationMaster 启动 Driver 线程,执行用户的作业;

6 AM 向 RM 注册,申请资源;

7 获取资源后 AM 向 NM 发送指令:bin/java CoarseGrainedExecutorBackend;

8 CoarseGrainedExecutorBackend 进程会接收消息,跟 Driver 通信,注册已经启动的 Executor;然后启动计算对象 Executor 等待接收任务

9 Driver 线程继续执行完成作业的调度和任务的执行。

10 Driver 分配任务并监控任务的执行。