版本信息
spark:3.0.0,
sacla:2.12
cdh:6.2.0
hive:2.1.1
本地IDEA saprk 连接公司CDH集群,hive-site.xml文件已经放到resources下面,
依赖也配置了
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.21</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.1.1</version>
</dependency>
测试代码:
val spark: SparkSession = SparkSession
.builder()
.enableHiveSupport()
.master("local[*]")
.appName("hive")
.config("hive.metastore.uris", "thrift://bd1.bcht:9083,thrift://bd2.bcht:9083")
.getOrCreate()
spark.sql("show databases ").show()
运行老是报错“Hive support because Hive classes are not found”。网上搜一般说是依赖的问题,但是我依赖看了半天应该没问题,我怀疑是不是hive_site.xml的问题,于是把xml文件改了,原来是cdh拷贝过来的,现在直接按原生的写,具体参数按CDH里xml修改就行了
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- hive元数据服务url -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://开启metastore主机ip:9083</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hive使用mysql库的ip:3306/hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>cdh-06.prod.ycsInsight.yonyou.com,cdh-02.prod.ycsInsight.yonyou.com,cdh-08.prod.ycsInsight.yonyou.com</value>
</property>
<!-- hive在hdfs上的存储路径 -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<!-- 集群hdfs访问url -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode节点IP:8020</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoStartMechanism</name>
<value>checked</value>
</property>
</configuration>
写完运行,还是老的错误。 于是接着按网上的方法,看源码 .enableHiveSupport()
def enableHiveSupport(): Builder = synchronized {
if (hiveClassesArePresent) {
config(CATALOG_IMPLEMENTATION.key, "hive")
} else {
throw new IllegalArgumentException(
"Unable to instantiate SparkSession with Hive support because " +
"Hive classes are not found.")
}
}
看来hiveClassesArePresent就已经为false了,接着往下看这个条件要找两个类的名称,看来是没有找到
private[spark] def hiveClassesArePresent: Boolean = {
try {
Utils.classForName(HIVE_SESSION_STATE_BUILDER_CLASS_NAME)
Utils.classForName("org.apache.hadoop.hive.conf.HiveConf")
true
} catch {
case _: ClassNotFoundException | _: NoClassDefFoundError => false
}
}
HIVE_SESSION_STATE_BUILDER_CLASS_NAME与org.apache.hadoop.hive.conf.HiveConf具体谁没找到呢,打断点debug,结合win10 IDEA环境下SparkSQL连接Hive的几个坑 这个思路,添加hadoop依赖
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.2.0</version>
</dependency>
我原先不对这步抱希望,因为我感觉spark的jar里面有这个包,已经准备往下走了,谁知道居然连上了,
后续问题也没遇到过。
总得来说Hive support because Hive classes are not found这个问题还是依赖问题,但如果不是hive,那么适当考虑一下hadoop依赖
昨天show databases没问题我以为问题解决了,今天一查数据的时候发现还是报错
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table acd_xxx. Invalid method name: 'get_table_req',
网上多解释是saprk自带的hive是2.3.7而公司集群hive是2.1.1不兼容的问题,
网上主要有两种办法
1.把spark/jars下面hive的相关jar全部删掉,替换成CDH/opt/cloudera/parcels/CDH-6.2.0-.cdh6.2.0.p0.967373/lib/hive/lib/的hive的jar。
2.:
这两种方式都指向一个问题,那就是依赖的jar包。
第一个办法用集群的hive/lib/hive.*的包替换本地window下spark/jar下相关hive的包,换成2.1.1,没效果,一样报错。
第二个
先把本地window的conf文件(复制spark-defaults.conf.template,重命名为spark-defaults.conf)添加了两行配置,不管用,但问题是这里的"/opt/..."是linux集群路径,不能直接用。设置网络路径有点麻烦,于是我把这个CDH集群的lib全部从cdh集群拷贝了过来,放在本地E:/大数据/spark/lib/
val conf = new SparkConf().set("spark.sql.hive.metastore.version","2.1")//2.1.1也可以
.set("spark.sql.hive.metastore.jars","E:/大数据/spark/lib/*")
.setMaster("local[*]").setAppName("hive")
查询
spark.sql("select * from dwd.user_visit_action").show(10) ;
折腾了一天多,终于出数据了,淦!!