spark源码阅读环境搭建
- 全部待安装工具
- mac pro
- java
- maven
- idea (scala 插件 + antlr 插件)
- 下载 java(需要注册oracle账号)
1. 下载(下载完成后直接傻瓜式安装即可,不断下一步)
https://www.oracle.com/java/technologies/downloads/#java8
2. 查看 JAVA_HOME 位置
- echo $JAVA_HOME 发现显示为空,说明并没有配置 JAVA_HOME
- which java 发现显示: /usr/bin/java 但是这并不是真正的 java home 位置
- 参考: https://developer.apple.com/library/content/qa/qa1170/_index.html 找到真正的 java home 位置:
- 执行: /usr/libexec/java_home -V
- 输出: /Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home
如上输出才是 java_home 真正位置,这个位置不设置,会导致 spark 编译失败
3. 添加环境变量 (vim ~/.bash_profile)
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_321.jdk/Contents/Home
export PATH=.:${JAVA_HOME}/bin:${PATH}
4. source ~/.bash_profile
使用 echo ${JAVA_HOME} 查看是否配置成功。
- 下载 maven
1. 下载
https://maven.apache.org/download.cgi
2. 添加环境变量 (vim ~/.bash_profile)
export MAVEN_HOME=/Users/yanhaoqiang/soft/apache-maven-3.8.4
export PATH=.:${MAVEN_HOME}/bin:${PATH}
3. source ~/.bash_profile
- 下载spark源码
git clone https://github.com/apache/spark.git
- 开始编译
- 执行命令:
mvn -DskipTests clean package
- 可能的报错处理
- 报错1找不到 javac
错误信息:
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.5.6:compile (scala-compile-first) on project spark-tags_2.12: wrap: java.io.IOException: Cannot run program "/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/javac" (in directory "/Users/yanhaoqiang/code/spark"): error=2, No such file or directory -> [Help 1]
解决方法:
参考 java 安装模块,上面错误的原因是没有指定 JAVA_HOME,设置后就解决了。
- 结果确认
输出下面内容表示编译成功
[INFO] Reactor Summary for Spark Project Parent POM 3.4.0-SNAPSHOT:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 6.535 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 13.719 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 13.920 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 21.111 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 32.964 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 12.634 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 21.693 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 14.268 s]
[INFO] Spark Project Core ................................. SUCCESS [03:59 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 39.530 s]
[INFO] Spark Project GraphX ............................... SUCCESS [01:09 min]
[INFO] Spark Project Streaming ............................ SUCCESS [01:51 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [03:47 min]
[INFO] Spark Project SQL .................................. SUCCESS [05:10 min]
[INFO] Spark Project ML Library ........................... SUCCESS [03:12 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 6.566 s]
[INFO] Spark Project Hive ................................. SUCCESS [01:38 min]
[INFO] Spark Project REPL ................................. SUCCESS [ 19.568 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 2.547 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [ 19.036 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 37.787 s]
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [ 53.695 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 46.980 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 8.720 s]
[INFO] Spark Avro ......................................... SUCCESS [01:02 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 28:05 min
[INFO] Finished at: 2022-03-20T16:00:13+08:00
[INFO] ------------------------------------------------------------------------
- 如何执行单元测试?
- 可以直接在 IDEA 里面的某个测试方法身上右键,然后执行 run 具体的函数。
- 也可以直接命令行执行单元测试,执行命令为:
mvn clean test -Dsuites="org.apache.spark.sql.DataFrameSuite dataframe toString"
其中,类名需要指定全路径名(org.apache.spark.sql.DataFrameSuite);方法名可以通过指定test 的 name 指定。spark 也是用的 scalatest-maven-plugin 插件来执行 scala 脚本的。所以 mvn clean test -Dsuites 的命令含义可以到插件官网具体查看参数说明:www.scalatest.org/user_guide/…。