Spark Core 源码阅读笔记之Worker节点进程解析

318 阅读2分钟

Worker节点进程解析

Work中的main函数

def main(argStrings: Array[String]) {
    Thread.setDefaultUncaughtExceptionHandler(new SparkUncaughtExceptionHandler(
      exitOnUncaughtException = false))
    Utils.initDaemon(log)
    val conf = new SparkConf
    val args = new WorkerArguments(argStrings, conf)
    val rpcEnv = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, args.cores,
      args.memory, args.masters, args.workDir, conf = conf)
    // With external shuffle service enabled, if we request to launch multiple workers on one host,
    // we can only successfully launch the first worker and the rest fails, because with the port
    // bound, we may launch no more than one external shuffle service on each host.
    // When this happens, we should give explicit reason of failure instead of fail silently. For
    // more detail see SPARK-20989.
    val externalShuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)
    val sparkWorkerInstances = scala.sys.env.getOrElse("SPARK_WORKER_INSTANCES", "1").toInt
    require(externalShuffleServiceEnabled == false || sparkWorkerInstances <= 1,
      "Starting multiple workers on one host is failed because we may launch no more than one " +
        "external shuffle service on each host, please set spark.shuffle.service.enabled to " +
        "false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.")
    rpcEnv.awaitTermination()
  }

Worker的main函数和Master的main函数几乎一样,都是先启动RPC环境并且设置一个EndPoint,不同的地方是多了一个判断的地方:

要求externalShuffleServiceEnabled为false或SPARK_WORKER_INSTANCES设置为1

原因如注释中所示:

在外部的shuffle服务开启的情况下,如果要启动多个Worker,因为端口绑定的原因,
则只有第一个能正常启动,其余的都会失败。我们只能在每个host上启动一个外部shuffle服务。

若保持externalShuffleServiceEnabled为默认false,则可以启动多个Worker。

这个external shuffle service呢,,则是在Worker上存在的一个服务service, 用于将shuffle任务从Executor中解放出来,减少Executor的压力,使其专注与任务处理,而不被shuffle所打扰。有利于提高效率。

startRpcEnvAndEndpoint函数和Master中的也雷同,不再分析

下面看看一下一个Worker这个EndPoint生命周期中的onStart()都干了什么:

  • 创建工作目录:createWorkDir
  • 创建shuffle服务:startExternalShuffleService
  • 创建并开启Worker的WebUI:WorkerWebUI()
  • 调用registerWithMaster向Master注册自己
  • 开启metricsSystem系统

小结

Worker的启动过程除了有一个external shuffle service启动判断之外,其余的和Master几乎一样。