启动脚本的解析

在SPARK_HOME/sbin下有以下几个重要的启动脚本：

spark-config.sh
spark-daemon.sh
start-all.sh
start-master.sh
start-slave.sh
stop-xxx.sh
....

每个start-开头的脚本都是用户常用的组件启动、管理spark的脚本。

本着学习源码的想法，重点研究start-master.sh这个脚本都做了什么。start-slave.sh不做过多说明

start-master.sh

#!/usr/bin/env bash

# Starts the master on the machine this script is executed on.

if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# NOTE: This exact class name is matched downstream by SparkSubmit.
# Any changes need to be reflected there.
CLASS="org.apache.spark.deploy.master.Master"

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  echo "Usage: ./sbin/start-master.sh [options]"
  pattern="Usage:"
  pattern+="\|Using Spark's default log4j profile:"
  pattern+="\|Registered signal handlers for"

  "${SPARK_HOME}"/bin/spark-class $CLASS --help 2>&1 | grep -v "$pattern" 1>&2
  exit 1
fi

ORIGINAL_ARGS="$@"

. "${SPARK_HOME}/sbin/spark-config.sh"

. "${SPARK_HOME}/bin/load-spark-env.sh"

if [ "$SPARK_MASTER_PORT" = "" ]; then
  SPARK_MASTER_PORT=7077
fi

if [ "$SPARK_MASTER_HOST" = "" ]; then
  case `uname` in
      (SunOS)
	  SPARK_MASTER_HOST="`/usr/sbin/check-hostname | awk '{print $NF}'`"
	  ;;
      (*)
	  SPARK_MASTER_HOST="`hostname -f`"
	  ;;
  esac
fi

if [ "$SPARK_MASTER_WEBUI_PORT" = "" ]; then
  SPARK_MASTER_WEBUI_PORT=8080
fi
echo $0 '最终运行以下命令：'
echo "-------------------------"
echo ${SPARK_HOME}/sbin/spark-daemon.sh  start  $CLASS  1  --host $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT $ORIGINAL_ARGS
echo "-------------------------"
echo ""
echo ""

"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
  --host $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT \
  $ORIGINAL_ARGS

通过在shell脚本中添加额外的打印信息，可以看出，start-master.sh脚本的作用主要有：

调用sbin/spark-config.sh和bin/load-spark-env.sh设置了一些基础的环境变量
设置SPARK_MASTER_PORT、SPARK_MASTER_HOST、SPARK_MASTER_WEBUI_PORT等变量
调用sbin/spark-daemon.sh来执行org.apache.spark.deploy.master.Master这个类中的main方法

接下来简单的看一下spark-daemon.sh

spark-daemon.sh

以下为脚本代码（添加的有额外的打印信息，建议阅读spark最原始的代码）

#!/usr/bin/env bash

# Runs a Spark command as a daemon.
#
# Environment Variables
#
#   SPARK_CONF_DIR  Alternate conf dir. Default is ${SPARK_HOME}/conf.
#   SPARK_LOG_DIR   Where log files are stored. ${SPARK_HOME}/logs by default.
#   SPARK_MASTER    host:path where spark code should be rsync'd from
#   SPARK_PID_DIR   The pid files are stored. /tmp by default.
#   SPARK_IDENT_STRING   A string representing this instance of spark. $USER by default
#   SPARK_NICENESS The scheduling priority for daemons. Defaults to 0.
#   SPARK_NO_DAEMONIZE   If set, will run the proposed command in the foreground. It will not output a PID file.
##
echo ""
echo $0 "被调用"
usage="Usage: spark-daemon.sh [--config <conf-dir>] (start|stop|submit|status) <spark-command> <spark-instance-number> <args...>"

# if no args specified, show usage
if [ $# -le 1 ]; then
  echo $usage
  exit 1
fi

if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

. "${SPARK_HOME}/sbin/spark-config.sh"

# get arguments

# Check if --config is passed as an argument. It is an optional parameter.
# Exit if the argument is not a directory.

if [ "$1" == "--config" ]
then
  shift
  conf_dir="$1"
  if [ ! -d "$conf_dir" ]
  then
    echo "ERROR : $conf_dir is not a directory"
    echo $usage
    exit 1
  else
    export SPARK_CONF_DIR="$conf_dir"
  fi
  shift
fi

option=$1
shift
command=$1
shift
instance=$1
shift
echo "option:$option"
echo "command:$command"
echo "instance:$instance"
echo "参数：$@"

spark_rotate_log ()
{
    log=$1;
    num=5;
    if [ -n "$2" ]; then
	num=$2
    fi
    if [ -f "$log" ]; then # rotate logs
	while [ $num -gt 1 ]; do
	    prev=`expr $num - 1`
	    [ -f "$log.$prev" ] && mv "$log.$prev" "$log.$num"
	    num=$prev
	done
	mv "$log" "$log.$num";
    fi
}

. "${SPARK_HOME}/bin/load-spark-env.sh"

if [ "$SPARK_IDENT_STRING" = "" ]; then
  export SPARK_IDENT_STRING="$USER"
fi


export SPARK_PRINT_LAUNCH_COMMAND="1"

# get log directory
if [ "$SPARK_LOG_DIR" = "" ]; then
  export SPARK_LOG_DIR="${SPARK_HOME}/logs"
fi
mkdir -p "$SPARK_LOG_DIR"
touch "$SPARK_LOG_DIR"/.spark_test > /dev/null 2>&1
TEST_LOG_DIR=$?
if [ "${TEST_LOG_DIR}" = "0" ]; then
  rm -f "$SPARK_LOG_DIR"/.spark_test
else
  chown "$SPARK_IDENT_STRING" "$SPARK_LOG_DIR"
fi

if [ "$SPARK_PID_DIR" = "" ]; then
  SPARK_PID_DIR=/tmp
fi

# some variables
log="$SPARK_LOG_DIR/spark-$SPARK_IDENT_STRING-$command-$instance-$HOSTNAME.out"
pid="$SPARK_PID_DIR/spark-$SPARK_IDENT_STRING-$command-$instance.pid"

# Set default scheduling priority
if [ "$SPARK_NICENESS" = "" ]; then
    export SPARK_NICENESS=0
fi

execute_command() {
  echo ""

  if [ -z ${SPARK_NO_DAEMONIZE+set} ]; then
      echo "原本后台执行命令"
      echo "-------------------------"
      echo "nohup -- "$@" >> $log 2>&1 < /dev/null &"
      # nohup -- "$@" >> $log 2>&1 < /dev/null &
      echo "-------------------------"
      "$@" &
      newpid="$!"
      echo "newpid:$newpid"

      echo "$newpid" > "$pid"

      # Poll for up to 5 seconds for the java process to start
      for i in {1..10}
      do
        if [[ $(ps -p "$newpid" -o comm=) =~ "java" ]]; then
           break
        fi
        sleep 0.5
      done

      sleep 2
      # Check if the process has died; in that case we'll tail the log so the user can see
      if [[ ! $(ps -p "$newpid" -o comm=) =~ "java" ]]; then
        echo "failed to launch: $@"
        tail -10 "$log" | sed 's/^/  /'
        echo "full log in $log"
      fi
  else
      "$@"
  fi
}

run_command() {
  mode="$1"
  shift

  mkdir -p "$SPARK_PID_DIR"

  if [ -f "$pid" ]; then
    TARGET_ID="$(cat "$pid")"
    if [[ $(ps -p "$TARGET_ID" -o comm=) =~ "java" ]]; then
      echo "$command running as process $TARGET_ID.  Stop it first."
      exit 1
    fi
  fi

  if [ "$SPARK_MASTER" != "" ]; then
    echo rsync from "$SPARK_MASTER"
    rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' "$SPARK_MASTER/" "${SPARK_HOME}"
  fi

  spark_rotate_log "$log"
  echo "starting $command, logging to $log"

  echo ""
  echo "->>mode为：$mode"
  case "$mode" in
    (class)
     
      echo "-------------------------"
      echo execute_command nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-class "$command" "$@"
      execute_command nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-class "$command" "$@"
      echo "-------------------------"
      ;;

    (submit)
      echo "-------------------------"
      echo execute_command nice -n "$SPARK_NICENESS" bash "${SPARK_HOME}"/bin/spark-submit --class "$command" "$@"
      execute_command nice -n "$SPARK_NICENESS" bash "${SPARK_HOME}"/bin/spark-submit --class "$command" "$@"
      echo "-------------------------"
      ;;

    (*)
      echo "unknown mode: $mode"
      exit 1
      ;;
  esac

}

echo ""
echo "-> option" $option
case $option in

  (submit)
    run_command submit "$@"
    ;;

  (start)
    run_command class "$@"
    ;;

  (stop)

    if [ -f $pid ]; then
      TARGET_ID="$(cat "$pid")"
      if [[ $(ps -p "$TARGET_ID" -o comm=) =~ "java" ]]; then
        echo "stopping $command"
        kill "$TARGET_ID" && rm -f "$pid"
      else
        echo "no $command to stop"
      fi
    else
      echo "no $command to stop"
    fi
    ;;

  (status)

    if [ -f $pid ]; then
      TARGET_ID="$(cat "$pid")"
      if [[ $(ps -p "$TARGET_ID" -o comm=) =~ "java" ]]; then
        echo $command is running.
        exit 0
      else
        echo $pid file is present but $command not running
        exit 1
      fi
    else
      echo $command not running.
      exit 2
    fi
    ;;

  (*)
    echo $usage
    exit 1
    ;;

esac

本脚本做了以下的事情：

设置了一些环境变量
- SPARK_LOG_DIR
- SPARK_NICENESS
定义了shell函数
- execute_command 用于具体的使用nohup执行对应的命令
- run_command 对应预处理调用本脚本的命令、调用 execute_command来执行命令
判断调用调用本脚本的参数，执行不同的操作
- submit 一般用来提交任务
- start 一般用来启动进程组件
- 。。。其他的先略过通过阅读本脚本可以得知，，本脚本最终是通过调用shell命令nohup 来将进程放入后台运行，达到daemon的目的

本脚本中其他的关于日志转储、进程pid记录、进程状态监控等，暂时不讨论，阅读源码中要抓住重点，有的放矢，不能一口吃成个胖子。

启动master进程，重新从start-master.sh开始

在阅读源码的过程中，笔者最喜欢将源码按照自己的思路运行起来，最好能脱离脚本，直达.java代码文件。这样不仅能清楚的跟踪源码每一步的执行过程，更能加深印象。接下来就追寻master进程启动的步骤，一步一步的查看都调用了执行了那些脚本、执行了哪些命令。

1.执行sbin/start-master.sh

sbin/start-master.sh

通过启动添加了额外输出代码的脚本，可以看出start-master.sh最终执行的命令是：

被调用了start-daemon.sh

spark-daemon.sh

通过打印脚本执行过程的信息，可以清晰的看出，最终调用了nohup

注意：在自己修改源代码的时候，一定不要在

      nohup -- "$@" >> $log 2>&1 < /dev/null &
      newpid="$!"

之间添加其他的东西，因为$!的值取决与上一个命令的返回值。

这个脚本最终会调用以下命令：

/Users/didi/Develop/spark-2.3.3-bin-hadoop2.6/bin/spark-class org.apache.spark.deploy.master.Master --host localhost --port 7077 --webui-port 8080 >> /Users/didi/Develop/spark-2.3.3-bin-hadoop2.6/logs/spark-didi-org.apache.spark.deploy.master.Master-1-localhost.out 2>&1 < /dev/null &

bin/spark-class脚本

如果启动其他的脚本，或使用其他的脚本，会发现，最终都是调用了这个脚本，来作为一个中转（命令转换器？），得到最终的可以执行的命令。

看一下脚本的内容：解析看以下内容的注释

#!/usr/bin/env bash
# 标识脚本开始执行
echo "=============enter spark-class"
if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

. "${SPARK_HOME}"/bin/load-spark-env.sh

# 找到java的路径，要开始使用java xxx的形式开始执行代码了
# Find the java binary
if [ -n "${JAVA_HOME}" ]; then
  RUNNER="${JAVA_HOME}/bin/java"
else
  if [ "$(command -v java)" ]; then
    RUNNER="java"
  else
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi

# 找到spark的jar的路径，这一步十分重要，很多ClassNotFound的Exception都是因为jars路径找不到或者设置不正确因为的
# Find Spark jars.
if [ -d "${SPARK_HOME}/jars" ]; then
  SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
  SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi

if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
  echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
  echo "You need to build Spark with the target \"package\" before running this program." 1>&2
  exit 1
else
  LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi

# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

# For tests
if [[ -n "$SPARK_TESTING" ]]; then
  unset YARN_CONF_DIR
  unset HADOOP_CONF_DIR
fi

# The launcher library will print arguments separated by a NULL character, to allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
# an array that will be used to exec the final command.
#
# The exit code of the launcher is appended to the output, so the parent shell removes it from the
# command array and checks the value to see if the launcher succeeded.
tempMessage=()
build_command() {
  tempMessage+="$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  printf "%d\0" $?
}

echo ${tempMessage[@]}

#以下代码是重点：

# Turn off posix mode since it does not allow process substitution
set +o posix
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <(build_command "$@")
# 这个while中的done << (build_command "$@")的含义是：
# 1.执行 build_command 函数
# 2.将build_command函数的结果重定向到这个while循环中：使用了<< 符号
# 3.使用 read 命令 循环读取函数返回的结果，while遍历则是将循环读取的结果放入到CMD这个数组中


echo "8888888"
echo "${CMD[@]}"
echo "8888888"
COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}

# Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes
# the code that parses the output of the launcher to get confused. In those cases, check if the
# exit code is an integer, and if it's not, handle it as a special error case.
if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
  echo "${CMD[@]}" | head -n-1 1>&2
  exit 1
fi

if [ $LAUNCHER_EXIT_CODE != 0 ]; then
  exit $LAUNCHER_EXIT_CODE
fi
# 下面的语句执行之后，CMD中将只包含我们需要的命令，不包含build_command的返回值了
CMD=("${CMD[@]:0:$LAST}")
echo "在设置环境变量后，可以手动执行"
echo "最终执行的CMD"
echo "${CMD[@]}"
echo "---------"
### 最终通过exec命令，调用组装好的命令，启动Master进程
exec "${CMD[@]}"

经过以上的脚本，得到了最终要执行的命令

/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/bin/java -cp /Users/didi/Develop/spark-2.3.3-bin-hadoop2.6/conf/:/Users/didi/Develop/spark-2.3.3-bin-hadoop2.6/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host localhost --port 7077 --webui-port 8080

化简一下

java -cp xxxxx:xxx:xxxxx -Xmx1g org.apache.spark.deploy.master.Master --host localhost --port 7077 --webui-port 8080

也就是用java开启动Master这个类，运行其中的main。

注意，此时还是在shell中运行的命令，即使使用了java xxx类名的形式，进程也还是在shell中，只不过使用exec命令，将当前的进程内容替换为java进程，并保持pid不变。

源码运行到这里，脚本已经做了它该做的全部事情：设置环境变量、日志处理、进程pid监控等，接下来就可以完全脱离脚本，在IDEA中配置一个run configuration，来模拟Master等启动，达到完全相同等效果

在IDEA中完全模拟启动Master进程

可以在上一步过程的spark-class脚本最后，将全部的环境变量打印保持在文件中，然后在IDEA的run configuration的时候配置模拟即可

我的配置如下：

配置细节：

VM option

Enviroment Variables

其中，LAUNCH_CLASSPATH变量是SPARK_HOME/jars下的所有jar包的绝对路径，很多很长

当配置完毕之后，就可以舍弃sbin下的start-master.sh，而使用IDEA的run，一键运行Master进行，并可以进行debug，一步一步的运行。

小结

Master的启动，涉及到的主要脚本是：

用户运行start-master.sh
调用spark-daemon.sh
调用spark-class
- 运行org.apache.spark.launcher.Mai，得到最终要执行的命令 CMD
- 运行最终的命令CMD
最终的CMD为一个java的程序命令：

/xxxx/java -cp /Users/didi/Develop/spark-2.3.3-bin-hadoop2.6/conf/:/Users/didi/Develop/spark-2.3.3-bin-hadoop2.6/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host localhost --port 7077 --webui-port 8080

在分析完启动脚本的流程作用之后，就知道了最终启动的是org.apache.spark.deploy.master.Master这个类，那么这个类中必定会有main方法，在IDEA中查看，果然有main方法。

下一篇将会分析Master中的main(),以及Spark中重要的Master节点的环境是怎么一步步建立起来的，并且结合本系列的最初的目的之一：了解Spark中的RPC框架，来揭示RPC。·

Spark Core 源码阅读笔记之启动脚本的解析

启动脚本的解析

start-master.sh

spark-daemon.sh

启动master进程，重新从start-master.sh开始

1.执行sbin/start-master.sh

spark-daemon.sh

bin/spark-class脚本

在IDEA中完全模拟启动Master进程

小结