手把手教学hive on spark,还不会的小伙伴快上车了

2,349 阅读5分钟

前言

经过前面几篇文章的讲解,相信大家都已经成功搭建Hadoop集群,Spark集群以及安装好了Hive。由于Hive默认的引擎是MR,相信体验过的小伙伴在执行SQL语句时,都会感叹怎么这么龟速呢,那有没有办法提升一下速度呢,答案是:yes!那就开始我们今天的学习之旅吧!

了解几个概念

  • Hive引擎包括:MR(默认)、tezspark
  • Hive on SparkHive既负责存储元数据又负责SQL的解析优化,语法是HQL语法,执行引擎变成了SparkSpark负责采用RDD执行。
  • Spark on Hive : Hive只负责存储元数据,Spark负责SQL解析优化,语法是Spark SQL语法,Spark负责采用RDD执行。

Hive3.1.2源码编译

使用hive3.1.2spark3.0.0配置hive on spark的时候,发现官方下载的hive3.1.2spark3.0.0不兼容,hive3.1.2对应的版本是spark2.3.0,而spark3.0.0对应的hadoop版本是hadoop2.6hadoop2.7。 所以,如果想要使用高版本的hivehadoop,我们要重新编译hive,兼容spark3.0.0。除了兼容spark3.0.0外,还将hive3.1.2guava的版本进行了提升,和hadoop3.x保持一致,以便兼容hadoop3.1.3

编译详细步骤

1、下载hive3.1.2源码包

下载地址: archive.apache.org/dist/hive/h…

2、本地解压并修改源码

2.1 本地解压

解压hive-3.1.2-src.tar.gz后,使用IDEA打开hive源码程序 ,没有安装IDEA的小伙伴,建议安装一下,真的很好用!

2.2 修改源码

修改源码中pom.xml文件:

<!-- 提升guava版本,和hadoop3.x保持一致 -->
<guava.version>27.0-jre</guava.version>
<!-- 修改spark版本以及对应的scala版本 -->
<spark.version>3.0.0</spark.version>
<scala.binary.version>2.12</scala.binary.version>
<scala.version>2.12.10</scala.version>

修改源码中的27个类:

修改内容参考 github.com/gitlbo/hive…

1. druid-handler/src/java/org/apache/hadoop/hive/druid/serde/DruidScanQueryRecordReader.java 2. llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/AMReporter.java 3. llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/LlapTaskReporter.java 4. llap-server/src/java/org/apache/hadoop/hive/llap/daemon/impl/TaskExecutorService.java 5. ql/src/test/org/apache/hadoop/hive/ql/exec/tez/SampleTezSessionState.java 6. ql/src/java/org/apache/hadoop/hive/ql/exec/tez/WorkloadManager.java 7. llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java 8. llap-common/src/java/org/apache/hadoop/hive/llap/AsyncPbRpcProxy.java 9. ql/src/test/org/apache/hadoop/hive/ql/stats/TestStatsUtils.java 10. spark-client/src/main/java/org/apache/hive/spark/client/metrics/ShuffleWriteMetrics.java 11. spark-client/src/main/java/org/apache/hive/spark/counter/SparkCounter.java 12. 新建类standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/ColumnsStatsUtils.java 13. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/DateColumnStatsAggregator.java 14. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/DecimalColumnStatsAggregator.java 15. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/DoubleColumnStatsAggregator.java 16. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/LongColumnStatsAggregator.java 17. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/StringColumnStatsAggregator.java 18. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/cache/DateColumnStatsDataInspector.java 19. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/cache/DecimalColumnStatsDataInspector.java 20. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/cache/DoubleColumnStatsDataInspector.java 21. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/cache/LongColumnStatsDataInspector.java 22. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/cache/StringColumnStatsDataInspector.java 23. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/DateColumnStatsMerger.java 24. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/DecimalColumnStatsMerger.java 25. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/DoubleColumnStatsMerger.java 26. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/LongColumnStatsMerger.java 27. standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java

3、压缩并上传

上述源码修改完成后,将hive-3.1.2-src进行压缩

tar -zcf hive-3.1.2-src.tar ./hive-3.1.2-src

然后将生成的压缩包上传到一台虚拟机的/opt/resource目录下。

注意:上传的虚拟机需要提前安装JDK以及Maven,这个比较简单,可自行百度安装一下!

4、解压编译

# 解压
tar -zxvf hive-3.1.2-src.tar
cd /opt/resource
# 使用Maven进行编译打包
mvn clean package -Pdist -DskipTests -Dmaven.javadoc.skip=true

编译成功后,在/opt/resource/packaging/target/ 中就会生成我们重新编译好的apache-hive-3.1.2-bin.tar.gz ,有需要的小伙伴可以在这里下载。

改造原有部署hive

1、拷贝修改配置文件

之前没有使用官方包安装hive的,可以直接使用我们编译好的hive进行安装,我这里使用官方的包安装过,需要进行改造原来的hive

cd /opt/module
# 修改旧的hive文件夹名称
mv hive hive-bak
# 解压我们重新编译的安装包
tar -zxvf apache-hive-3.1.2-bin.tar.gz
# 修改新的hive文件名称
mv apache-hive-3.1.2-bin hive
# 将之前安装的hive配置文件放入新hive中
cp ./hive-bak/conf/hive-site.xml ./hive/conf
cp ./hive-bak/conf/spark-defaults.conf ./hive/conf
# 修改hive-site.xml
vim ./hive/conf/hive-site.xml

添加如下内容:

 <!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)-->
 <property>
     <name>spark.yarn.jars</name>
     <value>hdfs://hadoop1:8020/spark-jars/*.jar</value>
 </property>

 <!--Hive执行引擎-->
 <property>
   <name>hive.execution.engine</name>
   <value>spark</value>
 </property>
 <!--Hive和Spark连接超时时间-->
 <property>
   <name>hive.spark.client.connect.timeout</name>
   <value>10000ms</value>
 </property>

注意:hive.spark.client.connect.timeout的默认值是1000ms,如果执行hiveinsert语句时,抛如下异常,可以调大该参数到10000msFAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for Spark session d9e0224c-3d14-4bf4-95bc-ee3ec56df48e

2、拷贝spark的jar到hive/lib目录下

cd /opt/module/spark-3.0.0-bin-hadoop3.2
cp py4j-0.10.9.jar pyrolite-4.30.jar RoaringBitmap-0.7.45.jar scala*.jar snappy-java-1.1.7.5.jar spark-core_2.12-3.0.0.jar spark-kvstore_2.12-3.0.0.jar spark-launcher_2.12-3.0.0.jar spark-network-common_2.12-3.0.0.jar spark-network-shuffle_2.12-3.0.0.jar spark-tags_2.12-3.0.0.jar spark-unsafe_2.12-3.0.0.jar /opt/module/apache-hive-3.1.2-bin/lib/

注意:一定要拷贝spark jar包!一定要拷贝spark jar包!一定要拷贝spark jar包!重要的事情说三遍!!! 如果没有这步操作,还是无法成功执行。

测试

1、启动环境

1.spark集群yarn模式成功搭建 2.启动hadoop集群的hdfsyarn服务 3.启动hive元数据存储服务MySQL

cd /opt/module/hive
# 启动客户端
bin/hive

2、插入数据测试

# 创建一张表:
hive (default)> create table student(id int, name string);
# 插入一条数据:
hive (default)> insert into table student values(1,'abc');

执行结果:

hive (default)> insert into table student values(1,'abc');
    Query ID = root_20220517214157_a976e115-4cbe-46d1-a26b-27878214e920
    Total jobs = 1
    Launching Job 1 out of 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Running with YARN Application = application_1652771632964_0004
    Kill Command = /opt/module/hadoop/bin/yarn application -kill application_1652771632964_0004
    Hive on Spark Session Web UI URL: http://hadoop3:45607

    Query Hive on Spark job[4] stages: [16, 17]
    Spark job[4] status = RUNNING
    --------------------------------------------------------------------------------------
              STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
    --------------------------------------------------------------------------------------
    Stage-16 .......         0      FINISHED      1          1        0        0       0  
    Stage-17 .......         0      FINISHED      1          1        0        0       0  
    --------------------------------------------------------------------------------------
    STAGES: 02/02    [==========================>>] 100%  ELAPSED TIME: 5.12 s     
    --------------------------------------------------------------------------------------
    Spark job[4] finished successfully in 5.12 second(s)
    Loading data to table default.student
    OK
    col1  col2
    Time taken: 31.066 seconds
    hive (default)> 

出现上述类似信息打印,证明hive on spark配置成功!

至此,我们就顺利完成了Hive 和 Spark的整合,大大提升了HQL执行的速度! 谢谢大家!

------------------ end -------------------

微信公众号:扫描下方二维码或 搜索 笑看风云路 关注

笑看风云路