spark源码解析之环境准备1

1,110 阅读14分钟

1、为什么要读源码

读源码可以帮助我们强身健体,健步如飞,外练筋骨皮,内修百家气。静可御敌于百里之外,动可取敌首级于万花丛中,源码实乃我们行走江湖,闯荡世界必备的必杀绝技,通过源码可以让我们御剑飞仙,跳出三界之外,不在五行之中,轻松实现芳龄18就有道骨仙风,头发胡子皆花白的仙体仙骨。。。。

好吧,以上都是我瞎编的,读源码不能让我们长生不老,但是读源码能让我们去吊打面试官,那种感觉爽滋滋,能让我们轻松打怪调优的各种小怪,能让我们见识源码的底层实现究竟是如何编写的,对于我们提高编码水平,为以后做二次开发实在有太多的帮助。

2、该如何读源码

读取源码最好也需要有一定的套路和章法,最怕就是没有目的,随心所欲,我们一定要带着一定的目的性每次去了解一个模块的功能作用,例如任务提交流程,master启动流程,worker启动流程,executor启动过程,sparkContext核心对象创建过程等等,每次搞定一个小的功能作用模块,所有的功能模块拼接起来就是一个完成的源码了,因此我们读取源码最好是能够有大神带入,告诉你整个的大概脉络,然后每一步去细致的验证。

1、下载源码导入idea当中进行源码阅读

这种方式一般用于深入解读框架流程,可以在其中加入我们自己的代码注释,方便忘记后回顾和深度剖析(一般情况下都是一个很长的过程,那么我们下一次就可以接着上次剖析的位置接着剖析)

源码下载:archive.apache.org/dist/spark/…

1、下载源码之后,解压到一个没有中文,没有空格的路径下

2、通过idea使用以下步骤进行导入源码

2、通过maven引入源码jar包进行源码剖析

此方法用于平时开发中需要看算子使用等情况时阅读,用如下xml即可关联IDEA去自动下载我们需要的包,然后直接阅读,例如我们创建一个maven工程,导入spark源码,然后进行代码开发,并关联源码

导入pom.xml

 <dependencies>
        <!-- scala -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.12.10</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-compiler</artifactId>
            <version>2.12.10</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-reflect</artifactId>
            <version>2.12.10</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.0.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-yarn_2.12</artifactId>
            <version>3.0.0</version>
        </dependency>


        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.0.0</version>
        </dependency>

        <!-- Spark Streaming -->
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.12</artifactId>
            <version>3.0.0</version>
            <scope>provided</scope>
        </dependency>

        <!--mysql依赖的jar包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.47</version>
        </dependency>

    </dependencies>
    <build>
        <plugins>
            <!-- 在maven项目中既有java又有scala代码时配置 maven-scala-plugin 插件打包时可以将两类代码一起打包 -->
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <!-- MAVEN 编译使用的JDK版本 -->
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.3</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.0.0</version>
                <executions>
                    <execution>
                        <phase>package</phase><!--绑定到package生命周期阶段-->
                        <goals>
                            <goal>single</goal><!--只运行一次-->
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <!--<finalName></finalName>&lt;!&ndash;主类入口&ndash;&gt;-->
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.10</version>
                <configuration>
                    <skip>true</skip>
                </configuration>
            </plugin>
        </plugins>
    </build>

使用spark开发单词计数统计程序如下

数据内容word.txt:

hello,spark
hello,scala,hadoop
hello,hdfs
hello,spark,hadoop
hello

代码实现:

import org.apache.spark.{SparkConf, SparkContext}

object WordCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("WordCount").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val tuples: Array[(String, Int)] = sc.textFile("hdfs://node01:8020/word.txt")
      .flatMap(_.split(","))
      .map((_, 1))
      .reduceByKey(_ + _)
      .collect()
    tuples.foreach(println)
  }
}

3、idea快捷键的设置

通过idea可以来设置快捷键

blog.csdn.net/u012453843/…

idea常用的快捷键

Ctrl+Z:撤销

Ctrl+Shift+Z:重做

Ctrl+X:剪贴

Ctrl+C:复制

Ctrl+V:粘贴

Ctrl+Y:删除当前行

Ctrl+D:复制当前行

Ctrl+Shift+J:将选中的行合并成一行

Ctrl+N:查找类文件

Ctrl+Shift+N:查找文件

Ctrl+G:定位到文件某一行

Alt+向左箭头:返回上次光标位置

Alt+向右箭头:返回至后一次光标位置

Ctrl+Shift+Backspace:返回上次编辑位置

Ctrl+Shift+反斜杠:返回后一次编辑位置

Ctrl+B:定位至变量定义的位置

Ctrl+Alt+B:定位至选中类或者方法的具体实现

Ctrl+Shift+B:直接定位至光标所在变量的类型定义

Ctrl+U:直接定位至当前方法override或者implements的方法定义处

Ctrl+F12:显示当前文件的文件结构

Ctrl+Alt+F12:显示当前文件的路径,并可以方便的将相关父路径打开

Ctrl+H:显示当前类的继承层次

Ctrl+Shift+H:显示当前方法的继承层次

Ctrl+Alt+H:显示当前方法的调用层次

F2:定位至下一个错误处

Shift+F2:定位至前一个错误处

Ctrl+Alt+向上箭头:查找前一个变量共现的地方

Ctrl+Alt+向下箭头:查找下一个变量共现的地方

Ctrl+=:展开代码

Ctrl+-:收缩代码

Ctrl+Alt+=:递归展开代码

Ctrl+Alt+-:递归收缩代码

Ctrl+Shift+=:展开所有代码

Ctrl+Shift+-:收缩所有代码

Ctrl+Shitft+向下箭头:将光标所在的代码块向下整体移动

Ctrl+Shift+向上箭头:将光标所在的代码块向上整体移动

Ctrl+Alt+Shift+向左箭头:将元素向左移动

Ctrl+Alt+Shift+向右箭头:将元素向右移动

Alt+Shift+向下箭头:将行向下移动

Alt+Shift+向上箭头:将行向上移动

Ctrl+F:在当前文件中查找

Ctrl+R:替换字符串

Ctrl+Shift+F:在全局文件中查找字符串

Ctrl+Shift+R:在全局中替换字符串

Alt+F7:查找当前变量的使用,并列表显示

Ctrl+Alt+F7:查找当前变量的使用,并直接对话框提示

Ctrl+F7:在文件中查找符号的使用

Ctrl+Shift+F7:在文件中高亮显示变量的使用

Ctrl+O:重写基类方法

Ctrl+I:实现基类或接口中的方法

Alt+Insert:产生构造方法,get/set方法等

Ctrl+Alt+T:将选中的代码使用if,while,try/catch等包装

Ctrl+Shitf+Delete:去除相关的包装代码

Alt+/:自动完成

Alt+Enter:自动提示完成,抛出异常

Ctrl+J:插入Live Template 快速插入一行或者多行代码

Ctrl+Alt+J:使用Live Template包装

Ctrl+/:使用//注释

Ctrl+Shift+/:使用/**/注释

Ctrl+Alt+L:格式化代码

Ctrl+Alt+I:自动缩进行

Ctrl+Alt+O:优化import

Ctrl+]:快速跳转至诸如{}围起来的代码块的结尾处

Ctrl+[:快速跳转至诸如{}围起来的代码块的开头处

Ctrl+Shift+Enter:将输入的if,for,函数等等补上{}或者;使代码语句完整

Shift+Enter:在当前行的下方开始新行

Ctrl+Alt+Enter:在当前行的上方插入新行

Ctrl+Delete:删除光标所在至单词结尾处的所有字符

Ctrl+Backspace:删除光标所在至单词开头处的所有字符

Ctrl+向左箭头:将光标移至前一个单词

Ctrl+向右箭头:将光标移至后一个单词

Ctrl+向上箭头:向上滚动一行

Ctrl+向下箭头:向下滚动一行

Ctrl+W:选中整个单词

Ctrl+Shift+U:切换大小写

Shift+F6:重命名

Ctrl+F6:更改函数签名

Ctrl+Shift+F6:更改类型

3、spark任务提交源码入手小试牛刀

1、spark3.0集群环境安装

由于实际工作当中,都是将spark的任务提交到yarn集群上面去,所以我们安装spark的环境只需要安装一个任务提交客户端即可

第一步:下载安装包并解压

node01下载spark3.0的安装包

cd /kkb/soft
wget http://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
tar -zxf spark-3.0.0-bin-hadoop3.2.tgz -C /kkb/install/

第二步:修改配置文件

node01执行以下命令修改spark-env.sh配置文件

cd /kkb/install/spark-3.0.0-bin-hadoop3.2/conf/
cp spark-env.sh.template spark-env.sh

vim spark-env.sh

export JAVA_HOME=/kkb/install/jdk1.8.0_141
export HADOOP_HOME=/kkb/install/hadoop-3.1.4
export HADOOP_CONF_DIR=/kkb/install/hadoop-3.1.4/etc/hadoop
export SPARK_CONF_DIR=/kkb/install/spark-3.0.0-bin-hadoop3.2/conf
export YARN_CONF_DIR=/kkb/install/hadoop-3.1.4/etc/hadoop

node01执行以下命令修改slaves配置文件

cd /kkb/install/spark-3.0.0-bin-hadoop3.2/conf/
cp slaves.template slaves

vim slaves

#编辑文件内容添为以下
node01
node02
node03

node01执行以下命令修改spark-defaults.conf配置选项

cd /kkb/install/spark-3.0.0-bin-hadoop3.2/conf
cp spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf


spark.eventLog.enabled  true
spark.eventLog.dir       hdfs://node01:8020/spark_log
spark.eventLog.compress true

第三步:安装包的分发

将node01机器的spark安装包分发到其他机器上面去

node01执行以下命令进行分发

cd /kkb/install/
scp -r spark-3.0.0-bin-hadoop3.2/ node02:$PWD
scp -r spark-3.0.0-bin-hadoop3.2/ node03:$PWD

第四步:spark集群的启动

node01执行以下命令启动spark集群

hdfs  dfs -mkdir -p /spark_log
cd /kkb/install/spark-3.0.0-bin-hadoop3.2
sbin/start-all.sh 
sbin/start-history-server.sh

第五步:访问浏览器管理界面

直接浏览器访问

http://node01:8080查看spark集群管理webUI界面。注意,如果8080端口没法访问,顺延8081端口进行访问,如果8081端口也没法访问,继续往后顺延端口号

http://node01:18080/ 访问查看spark的historyserver地址

2、spark运行计算圆周率之任务提交过程

spark集群安装运行成功之后,我们就可以运行计算spark的任务了,例如我们可以提交一个spark的自带案例来计算圆周率,

其中spark的任务提交又有多种方式,例如local模式,standAlone模式或者yarn模式等等,其中我们实际工作当中用的最多的就是yarn模式,以下是几种提交运行模式的介绍

2.1、spark任务提交的local模式

local模式不用启动任何的spark的进程,只需要解压一个spark的安装包就可以直接使用了,基于local模式的client提交运行方式,提交命令如下

bin/spark-submit --class org.apache.spark.examples.SparkPi --master local --deploy-mode client --executor-memory 2G --total-executor-cores 4 examples/jars/spark-examples_2.12-3.0.0.jar 10

基于local模式的cluster提交运行方式,提交命令如下

bin/spark-submit --class org.apache.spark.examples.SparkPi --master local --deploy-mode cluster --executor-memory 2G --total-executor-cores 4 examples/jars/spark-examples_2.12-3.0.0.jar 10

我们会看到,基于local模式的cluster提交方式直接报错

2.2、spark任务提交的standAlone模式

基于standAlone的任务提交,需要我们安装搭建spark集群,并启动master以及worker进程

提交命令如下

bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://node01:7077 \
--deploy-mode client \
--executor-memory 2G \
--total-executor-cores 4 \
examples/jars/spark-examples_2.12-3.0.0.jar 10 

基于cluster任务的提交命令

bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://node01:7077 \
--deploy-mode cluster \
--executor-memory 2G \
--total-executor-cores 4 \
examples/jars/spark-examples_2.12-3.0.0.jar 10 


2.3、spark任务提交的yarn模式

并且将任务提交到yarn集群上面去

node01执行以下命令提交任务到yarn集群上面去运行

bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
examples/jars/spark-examples_2.12-3.0.0.jar 50 \
10

其中spark on yarn cluster模式代码提交运行架构如下 7V1bc6M4Fv41VO0+xIUkEOIRO8nsw2Sra3qrdudpCxvZ8Qw2HozTyf76lQBhkIRNbMkh3e2emmBxl75zzncukh0027z+kse756csoakD3eTVQfcOhBAHhP3hLW9Vy50fBFXLKl8nVRs4Nnxd/4/WjW7delgndN85sMiytFjvuo2LbLuli6LTFud59q172DJLu3fdxSuqNHxdxKna+u91UjxXrQQGx/Z/0PXqWdwZ4LDas4nFwfWb7J/jJPvWakIPDprlWVZUW5vXGU1574l+qc577NnbPFhOt8WQE+530+Qx9A+//gVimn/NoyWZ3dVXeYnTQ/3C0W6Xrhdxsc62T/G+oLnzQJzpoxNOnYfAiYhD2IbvhK4TPfINEvHG6g2LN9FteXbYJpTfGTho+u15XdCvu3jB935jSGFtz8UmrXen8Zym03jx56o8bZalWc52bbMtO366L/Lsz6bzWbdNl9m2qJECMPsep+vVln1J6bLgu9dpKq7hQLRcLtinuVBrD8IoROwZpy80L9g7p1F9oSLjT6h2sOgtdjh9bTXVHf4LzTa0yN/YIfXeO+TXcKjxD12vhsO3I5pQWLc9t5Hk+zWKawSvmqsfB5lt1OP8jjGHypg7EKdF3bGdccR/HTKx425fdnnEDoBg93rcybZW5V8VFg+eM40cgssNBh233oiiP+KXmAMqZLs9DcTqJ5rn4uoMPvmfXw/zzboQO9nbV09c3/8qEJ4GURJTstSCCC8InS8NwQVgAY8aLyAMVbwEHlHxgrAtvCANXqSepglTmvXXLC+es1W2jdOHY+v0OBYu+3Y85teMy1o5An/Qonir5To+FFl3fHq7d58d8gU98fxebTbifEWL83LB3+XkYOU0ZerxpWsgdL1en/olW5di1QwyxFgaZOJ2r1I9a32iNHzNk1w+ol6vBtjv4u0gDeDqNMCey2gp4lP+X60GgPOAHfLIdUOlD6JS4JnYM+Fnu8KZE81UgWd/4w0X0u18z//M1+zJHstb3O1rPeA6/qylDaqnb05vLnjp+5i8xnU9K3fGD/HS9JW9c8rYH3z8I873zeiL9v/CCWAmduKyf+wAdh3f7e0YSWMxjVx0FYyi2QmcI4w1tsCnJPFUkyGziM06SUrdp7M8XX1oxHb4rqRWAle1HR7RcA3ouv1a7Crb4V/NNUCgg0bNBpgSiPLVYcO6bV+qm0eHwFKn+M7U5VomYoQClS0PXAHptIzllhuylCVZUD3VnRPf840hDRKZ1WqQBn0dq/VsIQ2fZymLQ/7SdDTdJhH3D9nXRRrv9+tFt6fp67r4T2v7dy6qE7/+dv9aS2755U182bI3aU7iX1pn8a/H08pv4rzLuU0wkNuQgdymNX664RNtV1MgF3YRBEIsAaN6dYUCqdcKA/lSYBibYsMfv7UO2/ED9ieemoTynZB75tmwdAbpOulso3oKo/QuuF7phjqle3e3qSICAzXbORtbO/htJVY3CTW3YALB7vfxthWGssaDqsYDQCMy1hQeuX6UyZlRliin+xbnW01zl4T/KJDwXUm2A4JVSJzSosYhEVoT/NJE/oCDDANZ7rEq9zpKbU3shY6xMMgJ3aXZ290mS+iPONSyigcaUtvw19uMtRqetzPWskZfpIceE/CD6nrQw+M6Ds9tsQGtK3t57LN8NYlZXz/TSRmImdDXeFFGYialP/5lrQOM3PQTQCVZ8DWxGR2AAmsAGhDXf5/HLLzfxsG17f3WfXrW+xWyMhr3F2GZVsgZgMHuL0BAvhay5P9qbkXOOMDAlz3g8BYeMNBlOIxg+2aRHcHzzoMbjAzcAMpECl4Mblfh37I+NAZu5VZALqhQojtKEMu/Bbb7Y+pD0y0A6/P3ZZScp+P9MoMXncjf82OiR2calrl74ERAYQcVxZzwoMHkd/a/WUUqWwUmNwm+n0wDvSv43q1EOR+OB/MYUKjQFrbHdfED671qYDq1KvxjinQg4kmIDgaSDiBHY82xjgFx+isGBaqDssrjZE073Twv/6kDtvT5P92A4fJjaGAw6dEb3WBioFPS9uigrajxb7RS90/xNl4NDx4bAkHjAWhgsFwmoS7By0QQ6tNoCZ5jH2v8CXMVY1imhlATcdJiw57MGgg1+zpo/DNLRgiLgMaYuiosBtSAWSwklB3IpmaoXRimS0BA2R0whwtdvFkaL6tp1ps5mk3p8VkyHoyLjBOZiyulHoO5eKDwYzmPYblqTXgRJuDmTsKQdCCHPXQadOzLF5qv2VtwxWEHiOdrFMfm7LkET1wSNp9Aghvb7XmXIk4xheGNEdcff9/ldLDdKw3elRVyvRcCzgPijhnRllkqpZin6wZZc/li3dbRvGtV9tnjQBp4wIG9YtVdkUjH3KWIYq0PSYlLuA47PX2iPVnC11CUJhJugqUQQUqOHqevkld8S/IqjIT1MDe0SD/QUPPgj8s8jIF+9EcCjZsLNaMCW9o55Bu85L2c7EKCagqM74TAmd7XqjryTsyiUadmyUAeUNXcqI0+RWMydzZcNUFDGkjON0BXM4EGoJtqoAHJiOG81SfCOarUjgsEjx3OWy/XRGBowamQhZ+ayIKeUTMAP2dtmgy2qFVfmnIPrA2c25u1qQuc/wzCfWgQ7uiQtoNwri4IJ9ekm8NFf9y+z29bVJ3GIZGv5n9jeoA9HLu729r6O98sC3hKEC3jzTp9q85pKnq4MHpOOXFznsfrLSNV7ibbZsoBzWX53n2JlM7VWxBFE39XqCDl+cPfnmYpR8llzuonfGk1x1qqblJqdZGiteCjGtbsvfxRTbssS2bar/KNKHeMJ76k3rHGSwWBOOw2LLE/yTIc0jWeazCfQXJ9lV4gN2jsgliLYNgTWbnPGWfLOQdhUGV85BQZGReQtYZKb1okHFOQ+DTQ4TjEAYpNZZEBA+4kkJAMNctLAIKa4zqVhXBijaz0z0QYHAT0eiCFGpd6EOnFTuSWKhQ75N6JHspFJ7zSrcZcl06D4zIUJiZV9z63sQ4QBbqa+a5sg72jr0rXff6yruSwET8eUmAHMPsyc8isbAmdKRYm5r60PkF9MDsgxGVZz4w9JY8yqLbJRsmNIoHvD4+2hdozJHsk6MY5ARk6AVau8zS3SseA7Ni74py9/XTe96+lfyy+PwBK9fXl5bYBD8Gg5uMiiUoQSaeaqk8MPPkdupNP7RQbIl0GzAqqzgfBh1Z7w5FNdvbUiYKXok9eTqoBgXm4IUnDiQnaJ7K1Uu5HKg63BFDYyzQGh0X6yKvXSa0y4sFJg8/LYcksemqZzpJRcGJCSpsYlDbRLytpMbezx4UnjEZWztLV9xtLE8bRk9c8gxrjqCtg86ytYKWb6vJpfCvQA8+vQ+nYKPwpLZsbGhpIGL8KkA65ke/yJfUsV1JDV+NqaVPbelfLWgkeMj3RpT1k4XX2emiqSKjwsdhrJDGt47oO77bX6rIRcvzIlL1WUhZipnI/l1Cm6eBb2Gvd7JVxAFbg8NMVeXryckzo4glXHpJtt635Vp4sZTCAZ55Nfk3xbHYBq+bd/BYvrAJOVdXGlE+fUqlfFaeJHluRpyrEUvJEZqmvzcSOlAcSX3IIkKexmFBHBK2Zy+8rWQZ1+dx/Pec0Tr5kWTo2Cnh5SH0eL0ii5X8MVJ5vKDXkBxMoZ4Y0syyAd/N4OlKTQw+vdHEosvy3w3Ybz9lY3lSFLOMkiPUrDXrYdw3FWXE4AVJ8DUCdDgETEqgDIltBc8OhpjfKCa5ZnO/pL1y6aSLGh8sK4zAiWkBEu2gINZNXB5cK9sb7j3dRV9b+KPEfe9FPL4tqF/3oKgftzZYVhrAFNNwiH7ca9XESDN9TfXKkUdc6hmHNIfdMB8zlvm9VgjrHOlBRI6qvAe3t7PM+0VAnXiB1JD4R9IGEjEvXF0WufCVLy4sCIt/IvYF748Gb4TVE4WDEmq1aFuA8H4oaV+YSIlMohgq4bKFYeeTwFij+1AF+fLp4amRe3PvmiPStnd+/2r4RLidHQ5HGj9P96o61pVk9XaD+80YetAsICfr5sH35/mp0e5ZMajw1jzNuXqEbcOeMcVxOvaecnX9P8nur2kZ5FpdOfrXE3p4A9y+79RnhjE79bMvJCd/jeMTeKkqp093mhq2tv7/vnvc04WWmNHmi+328ouVvVhmfSP/j9asm8nVJr45FsaJPoFjlnK9Wr+rm59vTq/3z2katV/vAPLjqTCsS6+08e+0tFhsX3q+q6BmwxpoRIiHh3QMq3gN9Osda9sBTk5DBiUrDJqgrD/TYJvvbKLQPQinC62kmuOgcOXsBXt08LVMBM95t162IxqHcWpakWbDE+Jok3tA1SbyxVcvINtC7eCUApIQZrl6TpO+hhd4Sd/JvEhw2sMBfL8JGggfg8bXgjh8pfAlkRTIUHCA8eV1lwXfLa0aI3moNJWmMDm3yhrLxqSfJ1VVMkRM+Xsc3PiRRCJUUDmzm9rZ5r27iPrJV0iESVMZ/sNCdkDBo2Y47vkYMPG0+jGZbRBLlrGUQmByJJmhOubqMEgYy4CxVUSo3Eo88PA90i0k6PrSG9c5iSEM5EnHhe0hS+c2cfAi4frbV3JAsH0oOcXBOXYatUkFjSz5uklP335+NHFEEcMSPeHSFv/t44MfOeALCyx1PnNA3/pM+RxsSdH1tAM942x9El8a1pJ5Sr4Tksv3BdKmvZsS4OZBvhLxbmAM1cxh2iji7KzodnTF1Suenc72UciCkqdAMsQpSa/E73/QPxV8u+WOLjMmrmkDv4h9lB658LWTtV9lhz1P3/24Xks+4iRrQzT0yY8CCsFNZyfx9790G7AN+4W7wBGE8LjnB8irnSLZXg+UEy1BUjKgpMVHvFJ4Rk97XtCsmpvMrjZhckk+RhCGwJw0C5OelYVw0ULUaiqYfLA2KKlfqnU1JQyjnU85KQ6hIw3U/9si+5hn3bo+HM8r2/JQllB/xfw==

基于 on yarn的cluster模式任务提交

bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
examples/jars/spark-examples_2.12-3.0.0.jar 50 \
10

其中spark on yarn client 模式任务提交过程如下

3、spark任务提交脚本分析

1、spark-submit脚本分析

提交任务的过程是通过spark-submit这个脚本来进行提交的,那我们就一起来看一下spark-submit这个脚本的内容

#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

查看发现spark-submit这个脚本里面执行了spark-class这个脚本,然后带了一个org.apache.spark.deploy.SparkSubmit这个参数,其中使用$@来将我们传入给spark-submit这个脚本的所有参数都传递过来了

2、spark-class脚本分析

既然知道了执行了spark-class脚本,后面带上了org.apache.spark.deploy.SparkSubmit这个class类,那么我们就来看一下spark-class这脚本内容

#!/usr/bin/env bash

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

. "${SPARK_HOME}"/bin/load-spark-env.sh

# Find the java binary
if [ -n "${JAVA_HOME}" ]; then
  RUNNER="${JAVA_HOME}/bin/java"
else
  if [ "$(command -v java)" ]; then
    RUNNER="java"
  else
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi

# Find Spark jars.
if [ -d "${SPARK_HOME}/jars" ]; then
  SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
  SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi

if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
  echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
  echo "You need to build Spark with the target \"package\" before running this program." 1>&2
  exit 1
else
  LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi

# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

# For tests
if [[ -n "$SPARK_TESTING" ]]; then
  unset YARN_CONF_DIR
  unset HADOOP_CONF_DIR
fi


build_command() {
  "$RUNNER" -Xmx128m $SPARK_LAUNCHER_OPTS -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  printf "%d\0" $?
}

# Turn off posix mode since it does not allow process substitution
set +o posix
CMD=()
DELIM=$'\n'
CMD_START_FLAG="false"
while IFS= read -d "$DELIM" -r ARG; do
  if [ "$CMD_START_FLAG" == "true" ]; then
    CMD+=("$ARG")
  else
    if [ "$ARG" == $'\0' ]; then
      # After NULL character is consumed, change the delimiter and consume command string.
      DELIM=''
      CMD_START_FLAG="true"
    elif [ "$ARG" != "" ]; then
      echo "$ARG"
    fi
  fi
done < <(build_command "$@")

COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}


if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
  echo "${CMD[@]}" | head -n-1 1>&2
  exit 1
fi

if [ $LAUNCHER_EXIT_CODE != 0 ]; then
  exit $LAUNCHER_EXIT_CODE
fi

CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"

最后执行了$CMD这个命令,并且带上了所有的参数,那么CMD究竟是个什么东西,我们可以修改脚本给它打印出来看一下

修改spark-class脚本,然后添加一行打印

cd /kkb/install/spark-3.0.0-bin-hadoop3.2/
vim bin/spark-class
 
#在后面位置添加一行打印命令出来
CMD=("${CMD[@]:0:$LAST}")
echo "${CMD[@]}"
exec "${CMD[@]}"

重新提交任务

cd /kkb/install/spark-3.0.0-bin-hadoop3.2/
bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn  --deploy-mode client examples/jars/spark-examples_2.12-3.0.0.jar 50

打印信息如下:

[hadoop@node01 spark-3.0.0-bin-hadoop3.2]$  bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn  --deploy-mode client examples/jars/spark-examples_2.12-3.0.0.jar 50

/kkb/install/jdk1.8.0_141/bin/java -cp /kkb/install/spark-3.0.0-bin-hadoop3.2/conf/:/kkb/install/spark-3.0.0-bin-hadoop3.2/jars/*:/kkb/install/hadoop-3.1.4/etc/hadoop/ -Xmx1g org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.12-3.0.0.jar 50

观察打印信息,主要就是执行了一个命令

java -cp /kkb/install/spark-3.0.0-bin-hadoop3.2/conf/:/kkb/install/spark-3.0.0-bin-hadoop3.2/jars/*:/kkb/install/hadoop-3.1.4/etc/hadoop/ -Xmx1g org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.12-3.0.0.jar 50

就是执行了一个java的命令,通过org.apache.spark.deploy.SparkSubmit来进行了任务提交,其实就是启动了一个jvm的虚拟机进程来执行了任务的提交,就是执行了SparkSubmit的main方法,我们可以去查看源码,找到SparkSubmit的main方法,验证启动的过程步骤