【误闯大数据】【从安装到入门】Spark部署

578 阅读2分钟
作者 日期 天气
元公子 2020-01-27(周一) 下雨后风大的东莞

你知道的越少,你不知道的就越少

没有朋友的点赞,就没法升级打怪

一、环境准备

  • 示例使用Centos7 64位操作系统
  • Java 1.8以上环境
  • 已安装Hadoop环境
  • 已安装Python环境
  • 已安装Scala环境

二、下载安装包

官方地址:前往下载页

下载最新版软件:spark-2.4.4-bin-without-hadoop.tgz

screenshot-spark.apache.org-2020.01.27-00_18_43

三、开始安装

创建安装目录

[root@hadoop-master /soft]$ tar -xvzf spark-2.4.4-bin-without-hadoop.tgz 
[root@hadoop-master /soft]# chown -R hadoop:hadoop spark-2.4.4-bin-without-hadoop
[root@hadoop-master /soft]# ln -s spark-2.4.4-bin-without-hadoop spark

设置环境变量,PYSPARK_DRIVER_PYTHON参数用于设置python环境,请参考常用环境篇

[root@hadoop-master /soft]# vi /etc/profile
export SPARK_HOME=/soft/spark
export SPARK_CONF_DIR=/home/hadoop/spark/conf
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=$ANACONDA_HOME/bin/ipython
export PYSPARK_PYTHON=$ANACONDA_HOME/bin/python
[root@hadoop-master /soft]# source /etc/profile

创建配置和暂存等目录文件夹

[root@hadoop-master /soft]# su - hadoop
[hadoop@hadoop-master /home/hadoop]$ mkdir -p /home/hadoop/spark/conf
[hadoop@hadoop-master /home/hadoop]$ cp -fr /soft/spark/conf/* /home/hadoop/spark/conf/

改改配置文件

[hadoop@hadoop-master /home/hadoop]$ cp /home/hadoop/spark/conf/spark-env.sh.template /home/hadoop/spark/conf/spark-env.sh
[hadoop@hadoop-master /home/hadoop]$ cp /home/hadoop/spark/conf/slaves.template /home/hadoop/spark/conf/slaves
[hadoop@hadoop-master /home/hadoop]$ vi /home/hadoop/spark/conf/spark-env.sh
export JAVA_HOME=/soft/jdk
export SCALA_HOME=/soft/scala
export SPARK_HOME=/soft/spark
export SPARK_CONF_DIR=/home/hadoop/spark/conf
export SPARK_LOG_DIR=/home/hadoop/spark/log
export SPARK_MASTER_IP=hadoop-master
export SPARK_WORKER_MEMORY=512m
export HADOOP_CONF_DIR=/soft/hadoop/etc/hadoop
export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)

[hadoop@hadoop-master /home/hadoop]$ vi /home/hadoop/spark/conf/slaves
hadoop-dn1
hadoop-dn2
hadoop-dn3

请按第五节内容,先处理可能遇到的异常后,再同步安装文件到子节点

# hadoop用户下
[hadoop@hadoop-master /soft]$ xrsync.sh /soft/spark
================ dn1 ==================
================ dn2 ==================
================ dn3 ==================
[hadoop@hadoop-master /soft]$ xrsync.sh /soft/spark-2.4.4-bin-without-hadoop
[hadoop@hadoop-master /soft]$ xrsync.sh /home/hadoop/spark
# root用户下
[hadoop@hadoop-master /soft]$ su - root
[root@hadoop-master /root]# xrsync.sh /etc/profile
[root@hadoop-master /root]# xcall.sh source /etc/profile

准备启动

[hadoop@hadoop-master /home/hadoop]$ run-example SparkPi 10
[hadoop@hadoop-master /home/hadoop]$ spark-shell --master local[2]
scala> :quit
[hadoop@hadoop-master /home/hadoop]$ pyspark --master local[2]
Using Python version 3.6.5 (default, Apr 29 2018 16:14:56)
SparkSession available as 'spark'.
In [1]: exit
[hadoop@hadoop-master /home/hadoo]$ /soft/spark/sbin/start-all.sh
org.apache.spark.deploy.master.Master running as process 36750.  Stop it first.
hadoop-dn3: starting org.apache.spark.deploy.worker.Worker, logging to /home/hadoop/spark/log/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop-dn3.out
hadoop-dn2: starting org.apache.spark.deploy.worker.Worker, logging to /home/hadoop/spark/log/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop-dn2.out
hadoop-dn1: starting org.apache.spark.deploy.worker.Worker, logging to /home/hadoop/spark/log/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop-dn1.out
[hadoop@hadoop-master /home/hadoo]$ jps
36750 Master
[hadoop@hadoop-dn1 /home/hadoo]$ jps
5653 Worker

# 补充:单独启动
# /soft/spark/sbin/start-master.sh     //启动master服务器
# /soft/spark/sbin/start-slaves.sh     //启动多个slave服务器

spark web ui

http://hadoop-master:8080

四、服务自启动

master节点

[hadoop@hadoop-master /home/hadoop]$ su - root
[root@hadoop-master /root]# vi /etc/systemd/system/spark-master.service
[Unit]
Description=spark-master
After=syslog.target network.target

[Service]
Type=forking
User=hadoop
Group=hadoop

ExecStart=/soft/spark/sbin/start-master.sh
ExecStop=/soft/spark/sbin/stop-master.sh

[Install]
WantedBy=multi-user.target
执行保存: Esc :wq
[root@hadoop-master /root]# chmod 755 /etc/systemd/system/spark-master.service
[root@hadoop-master /root]# systemctl enable spark-master
[root@hadoop-master /root]# service spark-master start

slave节点

[hadoop@hadoop-dn1 /home/hadoop]$ su - root
[root@hadoop-dn1 /root]# vi /etc/systemd/system/spark-slave.service
[Unit]
Description=spark-slave
After=syslog.target network.target

[Service]
Type=forking
User=hadoop
Group=hadoop

ExecStart=/soft/spark/sbin/start-slave.sh spark://hadoop-master:7077
ExecStop=/soft/spark/sbin/stop-slave.sh

[Install]
WantedBy=multi-user.target
执行保存: Esc :wq
[root@hadoop-master /root]# chmod 755 /etc/systemd/system/spark-slave.service
[root@hadoop-master /root]# systemctl enable spark-slave
[root@hadoop-master /root]# service spark-slave start

notebook

[hadoop@hadoop-master /home/hadoop]$ su - root
[root@hadoop-master /root]# vi /etc/init.d/notebook
#!/bin/sh
# chkconfig: 345 85 15
# description: service for notebook
# processname: notebook

case "$1" in
        start)
                echo "Starting hive"
                su - hadoop -c 'export PYSPARK_DRIVER_PYTHON_OPTS="notebook --config=/home/hadoop/.ipython/profile_myserver/ipython_notebook_config.py"; nohup pyspark >/dev/null 2>&1 &'
                echo "ipython_notebook started"
                ;;
        stop)
                echo "Stopping ipython_notebook"
                PID_COUNT=`ps aux |grep ipython_notebook |grep -v grep | wc -l`
                PID=`ps aux |grep ipython_notebook |grep -v grep | awk {'print $2'}`
                if [ $PID_COUNT -gt 0 ];then
                    echo "Try stop ipython_notebook"
                    kill -9 $PID
                    echo "Kill ipython_notebook SUCCESS!"
                else
                    echo "There is no ipython_notebook!"
                fi
                ;;
        restart)
                echo "Restarting ipython_notebook"
                $0 stop
                $0 start
                ;;
        status)
                PID_COUNT=`ps aux |grep ipython_notebook |grep -v grep | wc -l`
                if [ $PID_COUNT -gt 0 ];then
                    echo "ipython_notebook is running"
              	else
                    echo "ipython_notebook is stopped"
                fi
                ;;
        *)
                echo "Usage:$0 {start|stop|restart|status}"
                exit 1
esac
执行保存: Esc :wq
[root@hadoop-master /root]# chmod 755 /etc/init.d/notebook
[root@hadoop-master /root]# chkconfig --add notebook
[root@hadoop-master /root]# chkconfig notebook on
[root@hadoop-master /root]# service notebook start

五、偶遇的坑

  • Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger

往/soft/spark/jars目录考入log4j-1.2.17.jar,slf4j-api-1.7.30.jar,slf4j-log4j12-1.7.25.jar等三个jar包,版本最近即可。

[hadoop@hadoop-master /home/hadoop]$ ll /soft/spark/jars/log*
log4j-1.2.17.jar logging-interceptor-3.12.0.jar 
[hadoop@hadoop-master /home/hadoop]$ ll /soft/spark/jars/slf4j-*
slf4j-api-1.7.30.jar  slf4j-log4j12-1.7.25.jar
  • JAVA_HOME is not set
[hadoop@hadoop-master /home/hadoop]$ vi /soft/spark/sbin/spark-config.sh
# export PYSPARK_PYTHONPATH_SET=1 下面追加下面内容
  export JAVA_HOME=/soft/jdk
  export SPARK_HOME=/soft/spark
  export HADOOP_HOME=/soft/hadoop
  export HADOOP_CONF_DIR=/soft/hadoop/etc/hadoop
  export SPARK_CONF_DIR=/home/hadoop/spark/conf
  export SPARK_LOG_DIR=/home/hadoop/spark/log

六、补充内容

远程使用ipython notebook

[root@hadoop-master /root]# pip install ipython
[root@hadoop-master /root]# su - hadoop
[hadoop@hadoop-master /home/hadoop]$ ipython
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from IPython.lib import passwd

In [2]: passwd()
Enter password: 
Verify password: 
Out[2]: 'sha1:9435b2964949:cdcf603ca1cf095c5141270b66e9848db30d09f9'
#密码:123456

[hadoop@hadoop-master /home/hadoop]$ ipython profile create myserver
[hadoop@hadoop-master /home/hadoop]$ vi /home/hadoop/.ipython/profile_myserve/ipython_notebook_config.py

c = get_config()
c.IPKernelApp.pylab='inline'
c.NotebookApp.ip='*'
c.NotebookApp.open_browser=False
c.NotebookApp.password=u'sha1:9435b2964949:cdcf603ca1cf095c5141270b66e9848db30d09f9'
c.NotebookApp.port=8888

[hadoop@hadoop-master /home/hadoop]$ PYSPARK_DRIVER_PYTHON_OPTS="notebook --config=/home/hadoop/.ipython/profile_myserver/ipython_notebook_config.py" pyspark

附录: