0、背景
Hive支持的执行引擎截止目前一共三种,MR、Tez、Spark。在新版本的Hive中,MR已建议不再使用,推荐使用Tez或者Spark。由于已基于ClickHouse实现了实时数仓的构建,所以我们对于Hive的需求并不那么迫切。但是传统大数据的能力还是要具备的,所以这段时间就研究了一些。
目前我们的大数据体系中,已经引入了:HDFS、Yarn、HBase、Flink,外加小海豚调度和ClickHouse。由于大数据本身的门槛、后续运维问题,再结合团队的水平和能力,我们尽量降低系统的复杂性,减少组件的种类。所以,在引入Hive的时候,开始是基于MR,奈何新版本已经不建议使用,然后考虑Tez,你懂的由于个人水平有限,花了2-3天尽然没理出头绪来,我看不出Tez存在的任何价值,其只是MR的优化。以上只是个人的观点,不喜勿喷。最后,到了第三种引擎Spark。如果有第四种Flink引擎的话,我果断还是选择Flink,尽管Flink社区也就那样,但是这块我们技术积累多。Hive On Spark不看不打紧,一看感觉必须拥抱Spark社区。Hive和Spark有太多重叠的地方,细节不说了,只说结论,还是采用于Flink一样的模式,Spark On Yarn。外加Spark SQL实现与Hive表访问。
Hive官方给的Hive版本和Spark版本的映射关系:
截止到2024年11月hive的最新版本为4.0.1,适配hadoop 3.3.6,3.x生命周期已经结束。但是呢,你看看和spark版本的映射关系,只能使用2.3.0版本的,而spark版本目前最新版本为3.5.3,适配hadoop3,具体为3.3.4,适配的hive版本为2.3.9。
如何选择呢,大家按各自的需求来。我选择以Spark为主:spark3.5.3、hive2.3.4、hadoop3.3.6。
Hadoop Version | Hive Version | Spark Version |
---|---|---|
3.3.6 | 2.3.4 | 3.5.3 |
一、准备工作
a)hadoop proxyuser配置
core-site.xml,增加如下配置,解决权限问题,否则会出现:User: hadoop is not allowed to impersonate anonymous (state=08S01,code=0)错误。
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
a)环境变量配置
vi /etc/profile
配置环境变量,完整的环境变量如下:
#java
export JAVA_HOME=/usr/local/jdk1.8.0_201
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
#hadoop
export HADOOP_HOME=/mnt/data/di/hadoop-3.3.6
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
#hbase
export HBASE_HOME=/mnt/data/di/hbase-2.5.5
export PATH=$HBASE_HOME/bin:$HBASE_HOME/sbin:$PATH
#flink
export HADOOP_CLASSPATH=`hadoop classpath`
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
#hive
export HIVE_HOME=/data/di/hive-2.3.9
export PATH=$PATH:$HIVE_HOME/bin
#spark
export SPARK_HOME=/data/di/spark-3.5.3
export SPARK_DIST_CLASSPATH=$HADOOP_CLASSPATH
export PATH=$PATH:$SPARK_HOME/bin
二、安装Hive
1、安装
#下载sapache-hive-2.3.4-bin.tar.gz,到/data/di文件夹下。
#解压
cd /data/di
tar -zxvf apache-hive-2.3.4-bin.tar.gz
#修改目录名
mv apache-hive-2.3.4-bin hive-2.3.4
2、配置元数据存储位置(MySQL)
vim hive-site.xml
<configuration>
<!-- 配置hdfs存储目录 -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<!-- 配置hive执行引擎 -->
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<!--spark on yark依赖jars-->
<property>
<name>spark.yarn.jars</name>
<value>${fs.defaultFS}/spark-jars/*</value>
</property>
<!--元数据是否校验-->
<property>
<name>hive.metastore.schema.verification</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://di-mysql:3306/di_hive?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>******</value>
</property>
<!-- 指定 hiveserver2 连接的 host -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>di01</value>
</property>
<!-- 指定 hiveserver2 连接的端口号 -->
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
</configuration>
3、创建hive元数据库:
CREATE DATABASE `di_hive` /*!40100 DEFAULT CHARACTER SET utf8 */
CREATE USER 'hive'@'%' IDENTIFIED by '*******';
GRANT ALL PRIVILEGES ON di_hive.* TO 'hive'@'%' IDENTIFIED by '*******';
GRANT ALL PRIVILEGES ON di_hive.* TO 'hive'@'localhost' IDENTIFIED by '******';
flush privileges;
4、初始化
我们使用的是MySQL数据库,执行前将相应的数据库驱动mysql-connector-java-8.0.29.jar放到lib文件夹。
执行初始化脚本:
cd $HIVE_HOME
./schematool -dbType mysql -initSchema
3、启动hive
cd $HIVE_HOME
--启动metastore
nohup ./bin/hive --service metastore>> metastore.log 2>&1 &
--启动hiveserver2
nohup ./bin/hiveserver2>> hiveserver2.log 2>&1 &
通过如下命令查看服务是否启动成功:
netstat -anop| grep 10000
4、beeline连接到hive
$HIVE_HOME/bin/beeline -u jdbc:hive2://di01:10000 -n hadoop
三、Spark On Yarn With Hive配置
1、配置spark依赖jar和日志目录
下载spark-3.5.3-bin-without-hadoop.tgz,到/data/di文件夹下。
#解压
cd /data/di
tar -zxvf spark-3.5.3-bin-without-hadoop.tgz
#修改目录名
mv spark-3.5.3-bin-without-hadoop spark-jars-3.5.3
#hdfs创建spark-jars文件夹,并上传依赖jar 本步骤其实意义不大,因为版本不兼容
hdfs dfs -mkdir -p /spark-jars
hdfs dfs -put $SPARK_HOME/../spark-jars-3.5.3/jars/* /spark-jars/
#hdfs上创建spark-history存日志
hdfs dfs -mkdir -p /spark-history
#修改配置文件hive-site.xml,设置执行引擎为spark及jar依赖
2、配置Spark
下载spark-3.5.3-bin-hadoop3.tgz,到/data/di文件夹下。
#解压
cd /data/di
tar -zxvf spark-3.5.3-bin-hadoop3.tgz
#修改目录名
mv spark-3.5.3-bin-hadoop3 spark-3.5.3
#配置spark-defaults.conf
cd $SPARK_HOME/conf
vim spark-defaults.conf
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://dins/spark-history
spark.executor.memory 1g
spark.driver.memory 1g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.cores 2
spark.driver.memoryOverhead 400m
四、通过Spark SQL访问Hive
1、待执行语句如下:
create database test;
create table stu02(name string) row format delimited fields terminated by ',';
insert into table stu02(name) values ("Lucy");
2、通过hive入口
高版本的hive4.0.1在命令行设置执行引擎为spark都不支持。提示:
Invalid value.. expects one of [mr, tez]. (state=42000,code=1) hive4.0.1
版本真是个头疼的事情。我们通过hive客户端进行访问的时候,尤其其不支持高版本的spark,会执行insert语句的时候会出现如下的错误(管理语句正常):
[hadoop@di01 hive-2.3.9]$ $HIVE_HOME/bin/beeline -u jdbc:hive2://di01:10000 -n hadoop
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/data/di/hive-2.3.9/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/data/di/hadoop-3.3.6/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://di01:10000
Connected to: Apache Hive (version 2.3.9)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.9 by Apache Hive
0: jdbc:hive2://di01:10000> show databaes;
Error: Error while compiling statement: FAILED: ParseException line 1:5 cannot recognize input near 'show' 'databaes' '<EOF>' in ddl statement (state=42000,code=40000)
0: jdbc:hive2://di01:10000> show databases;
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (0.553 seconds)
0: jdbc:hive2://di01:10000> create databases data_insight;
Error: Error while compiling statement: FAILED: ParseException line 1:7 cannot recognize input near 'create' 'databases' 'data_insight' in ddl statement (state=42000,code=40000)
0: jdbc:hive2://di01:10000> create database data_insight;
No rows affected (0.131 seconds)
0: jdbc:hive2://di01:10000> show databases;
+----------------+
| database_name |
+----------------+
| data_insight |
| default |
+----------------+
2 rows selected (0.074 seconds)
0: jdbc:hive2://di01:10000> set hive.execution.engine
. . . . . . . . . . . . . > ;
+------------------------------+
| set |
+------------------------------+
| hive.execution.engine=spark |
+------------------------------+
1 row selected (0.01 seconds)
0: jdbc:hive2://di01:10000> set hive.execution.engine=spark;
No rows affected (0.004 seconds)
0: jdbc:hive2://di01:10000> use data_insight;
No rows affected (0.073 seconds)
0: jdbc:hive2://di01:10000> create table stu01(name string) partitioned by(age int) row format delimited fields terminated by ',';
No rows affected (0.373 seconds)
0: jdbc:hive2://di01:10000> show tables;
+-----------+
| tab_name |
+-----------+
| stu01 |
+-----------+
1 row selected (0.084 seconds)
0: jdbc:hive2://di01:10000> select 1 from stu01;
+------+
| _c0 |
+------+
+------+
No rows selected (1.139 seconds)
0: jdbc:hive2://di01:10000> insert into table stu01(name,age) values ("Lucy",22);
Error: Error while compiling statement: FAILED: SemanticException 1:24 'age' in insert schema specification is not found among regular columns of data_insight.stu01 nor dynamic partition columns.. Error encountered near token 'age' (state=42000,code=40000)
0: jdbc:hive2://di01:10000> create table stu02(name string) row format delimited fields terminated by ',';
No rows affected (0.271 seconds)
0: jdbc:hive2://di01:10000> insert into table stu02(name) values ("Lucy");
Error: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. java.lang.NoClassDefFoundError: org/apache/spark/AccumulatorParam
at org.apache.hive.spark.counter.SparkCounterGroup.createCounter(SparkCounterGroup.java:52)
at org.apache.hive.spark.counter.SparkCounters.createCounter(SparkCounters.java:71)
at org.apache.hive.spark.counter.SparkCounters.createCounter(SparkCounters.java:67)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:337)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.AccumulatorParam
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 10 more
完全不成功,怎么办,只能凉拌,放弃hive入口。我们尝试下Spark SQL,返现一切正常。
[hadoop@di01 bin]$ ./spark-sql
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/17 14:18:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/11/17 14:18:58 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/11/17 14:19:00 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
24/11/17 14:19:17 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/11/17 14:19:17 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
24/11/17 14:19:17 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
Spark Web UI available at http://di01:4040
Spark master: yarn, Application Id: application_1731211686337_0008
spark-sql (default)> show databases;
default
Time taken: 1.335 seconds, Fetched 1 row(s)
spark-sql (default)> show databases;
data_insight
default
Time taken: 0.038 seconds, Fetched 2 row(s)
spark-sql (default)>
>
> show tables;
24/11/17 14:48:21 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Time taken: 0.077 seconds
spark-sql (default)> use data_insight;
Time taken: 0.169 seconds
spark-sql (data_insight)> show tables;
stu01
stu02
Time taken: 0.1 seconds, Fetched 2 row(s)
spark-sql (data_insight)> select 1 from stu02;
24/11/17 14:48:54 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Time taken: 1.14 seconds
spark-sql (data_insight)> insert into table stu02(name) values ("Lucy");
24/11/17 14:49:58 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
24/11/17 14:49:58 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/11/17 14:49:58 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
Time taken: 2.996 seconds
spark-sql (data_insight)> select * from stu02;
Lucy
Time taken: 0.325 seconds, Fetched 1 row(s)
spark-sql (data_insight)>
由此也弄明白了我们应该怎么做,由于hive社区本省集成spark版本太老旧,节奏太慢,目前完全可以通过spark作为统一的入口,包括通过Spark SQL来连接hive,创建表、查询表、入数据等。整体调用流程如下: