大数据开发SparkSQL集成Hive（第三十七篇）一、相关概念 SparkSQL集成Hive其实就是在SparkSQL

一、相关概念

SparkSQL集成Hive其实就是在SparkSQL中直接操作Hive中的表。在SparkSQL中操作Hive表，底层走的就不是MapReduce引擎了，而是Spark引擎，SparkSQL会读取Hive中的元数据，以及存储在HDFS上面的数据，最终通过Spark引擎进行计算。通过这种方式可以利用Spark计算引擎提高计算效率。并且也不需要每次在使用的时候临时在SparkSQL建表，省略建表的这个复杂过程。

常见的用法：

在SparkSQL命令行中集成Hive

这种方式便于调试，主要在调试阶段会使用
在SparkSQL代码中集成Hive

在代码中直接操作Hive中的表

1.1、代码集成Hive

引入pom文件

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.12</artifactId>
    <version>3.1.2</version>
</dependency>

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>6.0.6</version>
</dependency>

hive-site.xml配置

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
        <description>location of default database for the warehouse</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://xxx:3306/hive?useSSL=false</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>用户名</value>
        <description>Username to use against metastore database</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>密码</value>
        <description>password to use against metastore database</description>
    </property>
</configuration>

scala代码

package com.strivelearn.scala.sql

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * 在代码中通过SparkSql操作hive
 *
 * @author strivelearn
 * @version SparkSQLReadHive.java, 2022年12月03日
 */
object SparkSQLReadHive {
  def main(args: Array[String]): Unit = {
    //创建SparkContext
    val conf = new SparkConf().setMaster("local")
    val sparkSession = SparkSession
      .builder()
      //开启对Hive的支持，支持连接Hive的MetaStore
      .appName("SparkSQLReadHive")
      .config(conf)
      .enableHiveSupport()
      .getOrCreate()
    sparkSession.sql("select * from student1").show()
    sparkSession.stop()
  }
}

效果
原理

通过hive的mate数据库信息获取hive表存方法hdfs的位置，然后通过spark进行链接

二、SparkSQL写入Hive表的几种方式

使用insertInto方式
使用saveAsTable
使用SparkSQL语句

package com.strivelearn.scala.sql

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * @author strivelearn
 * @version SparkSQLWriteHive.java, 2022年12月03日
 */
object SparkSQLWriteHive {
  def main(args: Array[String]): Unit = {
    //创建SparkContext
    val conf = new SparkConf().setMaster("local")
    val sparkSession = SparkSession
      .builder()
      //开启对Hive的支持，支持连接Hive的MetaStore
      .appName("SparkSQLWriteHive")
      .config(conf)
      .enableHiveSupport()
      .getOrCreate()
    val resDataFrame = sparkSession.sql("select * from student1")
    //1.insertInto()
    /**
     * 需要满足2个条件：
     * 1.写入的hive表是存在的（可以在hive中建表，或者通过SparkSQL建表，官方建议在Hive中建表）
     * 2.DataFrame数据的Schema结构顺序和写入的hive表的Schema结构顺序是一样的
     *
     * 这种情况，可以考虑直接输出HDFS文件，只需要将数据文件写入到Hive表对应的HDFS目录下即可
     */

    /**
     * 官方建议提前在Hive中创建表，在SparkSql中直接使用
     * 注意：通过SparkSql创建Hive表的时候，如果想要指定存储格式等参数（默认是TextFile），则必须要使用using hive
     * 这样指定是无效的：create table t1(id int) OPTIONS(fileFormat 'parquet')
     * 这样才是有效的：create table t2(id int) using hive OPTIONS(fileFormat 'parquet')
     * create table t1(id int) 等同于 create table t1(id int) using hive OPTIONS(fileFormat 'parquet')
     */
    import sparkSession.sql
    sql("create table if not exists student2(id int,name string)using hive options(fileFormat='textFile',fieldDelim ',)"
      .stripMargin)
    resDataFrame.write.mode(saveMode = "overwrite").insertInto("student2")
    sparkSession.stop()
  }
}

使用sparksql创建表

package com.strivelearn.scala.sql

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

/**
 * @author strivelearn
 * @version SparkSQLWriteHive.java, 2022年12月03日
 */
object SparkSQLWriteHive3 {
  def main(args: Array[String]): Unit = {
    //创建SparkContext
    val conf = new SparkConf().setMaster("local")
    val sparkSession = SparkSession
      .builder()
      //开启对Hive的支持，支持连接Hive的MetaStore
      .appName("SparkSQLWriteHive")
      .config(conf)
      .enableHiveSupport()
      .getOrCreate()
    //表不存在
    import sparkSession.sql
    sql(
      """
        |CREATE TABLE student2_bak
        |as
        |select * from student1
        |""".stripMargin)


    //表存在的
    sql(
      """
        |insert into student2_bak
        |select * from student1
        |""".stripMargin)
     sparkSession.stop();
  }
}