Hudi学习笔记
前言
上一篇快速的使用spark-shell体验了一把hudi,但是由于CDH版本自带的spark版本问题,有很多异常.而且实际工作中肯定用IDEA更为通用 本地spark:3.0.0 集群:chd:6.2.0&spark 2.4.0 hive:2.1 hudi:0.9.0 scala:2.12
一、IDEA链接CDH环境准备
主要两步: 1.把依赖hudi-hadoop-mr-bundle-0.9.0.jar 包放到集群hive/lib下,这一步主要让hive能正常独取hudi的数据.(spark-sql访问同样需要找个包,后面会说)
cp hudi-hadoop-mr-bundle-0.9.0.jar /export/server/hive/lib/
2.把集群的三个配置文件方到resources 文件下。
3.pom.xml添加依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>SparkDemo</artifactId>
<groupId>org.example</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>spark-code</artifactId>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.21</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
</project>
二、写数据到CDH集群HDFS保存
这段代码很长,里面包含了插入数据,查询数据,按时间查询,增量查询,和更新数据等等操作,像测试哪个,直接把其他函数注释调就行了,这里就不拆分单独说了,如果copy过去运行异常,自行调试就好。
insertData就是插入数据,执行完成之后,过一会就可以到HDFS上看是不是数据进来了,如果程序没问题,数据也没写进来,可以看一下pom中是不是缺少hadoop的依赖和xml文件是否正确。
下面是我生成的HDFS文件数据
package hudidemo
import org.apache.hudi.DataSourceReadOptions
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import scala.collection.JavaConverters._
object testHudi{
def main(args: Array[String]): Unit = {
// 创建sparkSQL的运行环境
val conf = new SparkConf().setMaster("local[*]").setAppName("insertDatasToHudi")
val spark = SparkSession.builder().config(conf)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("hive.metastore.uris", "thrift://bd1.bcht:9083,thrift://bd2.bcht:9083")
.getOrCreate()
// 设置序列化方式:Kryo
//定义变量:表名,数据存储路径
val tableName : String = "tb1_trips_cow"
val tablePath : String = "/hudidatas/hudi-warehouse/tb1_trips_cow"
//数据生成器
val generator = new DataGenerator()
//插人数据
// insertData(spark,tableName,tablePath,"append",generator)
//查询数据
queryData(spark,tablePath)
//根据时间查询
// queryDataByTime(spark,tablePath)
//更新
//updateData(spark,generator,tableName,tablePath)
//增量查询
//incrementQueryData(spark,tablePath)
//关闭
spark.stop()
}
/**
* 插入数据
* @param spark
* @param tableName
* @param tablePath
* @param savemode
*/
def insertData (spark: SparkSession,tableName:String,tablePath:String,savemode :String,dataGen:DataGenerator): Unit = {
//导入隐式转换
import spark.implicits._
// 第1步、模拟乘车数据
//val generator: DataGenerator = new DataGenerator()
val insertData = convertToStringList {dataGen.generateInserts(100)}
val dataDF = spark.read.json(spark.sparkContext.parallelize(insertData, 2).toDS())
//保存数据
dataDF.write
.format("hudi")
.mode(savemode)
.option("hoodie.insert.shuffle.parallelism", "2")
.option("hoodie.upsert.shuffle.parallelism", "2")
// Hudi 表的属性值设置
.option(PRECOMBINE_FIELD.key(), "ts")
.option(RECORDKEY_FIELD.key(), "uuid")
.option(PARTITIONPATH_FIELD.key(), "partitionpath")
.option(TBL_NAME.key(), tableName)
.save(tablePath)
}
/**
* 查询数据
* @param spark
* @param tablePath
*/
def queryData(spark: SparkSession,tablePath:String): Unit ={
import spark.implicits._
// spark-shell
val tripsSnapshotDF = spark.
read.
format("hudi").
load(tablePath)
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
tripsSnapshotDF.printSchema()
//通过spark-sql查询
spark.sql("select * from hudi_trips_snapshot where fare > 20.0").show()
//通过DF 查询费用大于20,小于50的乘车数据
// tripsSnapshotDF
// .filter($"fare" >= 20 && $"fare" <= 50)
// .select($"driver", $"rider", $"fare", $"begin_lat", $"begin_lon", $"partitionpath", $"_hoodie_commit_time")
// .orderBy($"fare".desc, $"_hoodie_commit_time".desc)
// .show(20, truncate = false)
}
/**
* 按日期查询
* @param spark
* @param tablepath
* @param begintime
* @param endtime
*/
def queryDataByTime(spark: SparkSession,tablepath:String): Unit ={
import org.apache.spark.sql.functions._
// 方式一:指定字符串,格式 yyyyMMddHHmmss
val df1 = spark.read
.format("hudi")
.option("as.of.instant", "20220624102053")
.load(tablepath)
.sort(col("_hoodie_commit_time").desc)
df1.show(numRows = 5, truncate = false)
// 方式二:指定字符串,格式 yyyyMMddHHmmss
val df2 = spark.read
.format("hudi")
.option("as.of.instant", "2022-06-24 16:14:39")
.load(tablepath)
.sort(col("_hoodie_commit_time").desc)
df2.show(numRows = 5, truncate = false)
val df3 = spark.read
.format("hudi")
.option("as.of.instant", "2022-06-24")
.load(tablepath)
df3.show(numRows = 5, truncate = false)
}
/*
更新Hudi数据,运行程序时,更新数据Key是存在的
必须要求与插入数据使用同一个DataGenerator对象
*/
def updateData(spark: SparkSession,dataGen:DataGenerator,tableName:String,tablePath:String): Unit ={
import org.apache.hudi.QuickstartUtils._
import spark.implicits._
import scala.collection.JavaConverters._
val updates = convertToStringList(dataGen.generateUpdates(100))//generateUpdates 区别
val updateDF = spark.read.json(spark.sparkContext.parallelize(updates.asScala, 2).toDS())
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
updateDF.write
.mode("append")
.format("hudi")
.option("hoodie.insert.shuffle.parallelism", "2")
.option("hoodie.upsert.shuffle.parallelism", "2")
.option(PRECOMBINE_FIELD.key(), "ts")
.option(RECORDKEY_FIELD.key(), "uuid")
.option(PARTITIONPATH_FIELD.key(), "partitionpath")
.option(TBL_NAME.key(), tableName)
.save(tablePath)
}
/**
* 增量查询
* @param spark
*/
def incrementQueryData(spark: SparkSession,path: String): Unit ={
import spark.implicits._
spark.read
.format("hudi")
.load(path )
.createOrReplaceTempView("view_temp_hudi_trips")
val commits: Array[String] = spark
.sql(
"""
|select
| distinct(_hoodie_commit_time) as commitTime
|from
| view_temp_hudi_trips
|order by
| commitTime DESC
|""".stripMargin
)
.map(row => row.getString(0))
.take(50)
val beginTime = commits(commits.length - 1) // commit time we are interested in
println(s"beginTime = ${beginTime}")
// TODO: b. 设置Hudi数据CommitTime时间阈值,进行增量查询数据
val tripsIncrementalDF = spark.read
.format("hudi")
// 设置查询数据模式为:incremental,增量读取
.option(QUERY_TYPE.key(), QUERY_TYPE_INCREMENTAL_OPT_VAL)
// 设置增量读取数据时开始时间
.option(BEGIN_INSTANTTIME.key(), beginTime)
.load(path)
// TODO: c. 将增量查询数据注册为临时视图,查询费用fare大于20的数据信息
tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
spark
.sql(
"""
|select
| `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts
|from
| hudi_trips_incremental
|where
| fare > 20.0
|""".stripMargin
)
.show(10, truncate = false)
// spark.read()
// .format("org.apache.hudi")
// .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
// .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), <beginInstantTime>)
// .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY(), "/year=2020/month=*/day=*") // Optional, use glob pattern if querying certain partitions
// .load(tablePath); // For incremental query, pass in the root/base path of table
//
// hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental")
// spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_incremental where fare > 20.0").show()
}
}
三.Hive读取HDFS的hudi数据(重点)
本来以为既然数据已经到了HDFS上了,建个外部表访问应该水到渠成,谁知道踩的坑最多。
先看一下hudi生成数据的schema( queryData 函数里打印tripsSnapshotDF.printSchema())
root
|-- _hoodie_commit_time: string (nullable = true)
|-- _hoodie_commit_seqno: string (nullable = true)
|-- _hoodie_record_key: string (nullable = true)
|-- _hoodie_partition_path: string (nullable = true)
|-- _hoodie_file_name: string (nullable = true)
|-- begin_lat: double (nullable = true)
|-- begin_lon: double (nullable = true)
|-- driver: string (nullable = true)
|-- end_lat: double (nullable = true)
|-- end_lon: double (nullable = true)
|-- fare: double (nullable = true)
|-- rider: string (nullable = true)
|-- ts: long (nullable = true)
|-- uuid: string (nullable = true)
|-- partitionpath: string (nullable = true)
根据这段代码在hive里建外部表(可以先建一个hudi库)
CREATE EXTERNAL TABLE `tb1_trips_cow`(
`_hoodie_commit_time` string,
`_hoodie_commit_seqno` string,
`_hoodie_record_key` string,
`_hoodie_partition_path` string,
`_hoodie_file_name` string,
begin_lat double,
begin_lon double,
driver string,
end_lat double,
end_lon double,
fare double,
rider string,
ts bigint,
uuid string,
partitionpath string
)
PARTITIONED BY (area string,county string ,city string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://nameservice1/hudidatas/hudi-warehouse/tb1_trips_cow';
这个建表语句里面坑我踩过的有这么几个
1.不可以都用string类型:我平常建hive表喜欢全string类型,但是在hudi里就会报错,说类型不一致,XXXXWritable(long ,doubule)错误都是这个问题。
"Cannot inspect org.apache.hadoop.hive.serde2.io.HiveDoubleWritable"
2.PARTITIONED BY 分区设计
hudi和hive一样都是分区设计,比如hdfs上一共三层才到数据文件。partitionpath也确实看上去是个路径,然后我一开始
PARTITIONED BY ( partitionpath string),怎么查,怎么改都没有数据,但不报错,数据明明在哪,一查就是为空,真的很痛苦。
睡了一觉第二天回来,忽然想起来hive的多分区问题,这个三个目录层级就要多个三个分区字段啊,于是自定义三个字段
PARTITIONED BY (area string,county string ,city string),重新建表。
3.需要手动添加表分区
上面重设计了分区,一查还是没数据。又折腾了一阵,发现要手动添加分区
把每个分区都要手动添加(目前没啥好办法)
ALTER TABLE hudi.tb1_trips_cow ADD IF NOT EXISTS PARTITION (area='asia',county='india' ,city='chennai' ) LOCATION '/hudidatas/hudi-warehouse/tb1_trips_cow/asia/india/chennai';
4存储格式和输入输出要指定 我最开始 INPUTFORMAT在网上随便找了一个 " 'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat" 这是一个什么Realtime的,暂时不需要,建议用下面的格式
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
一个简单的建表语句,里面竟是坑,爬上来花了很长时间和精力。
数据查出来,还是比较开心的,总算爬上来了
四、spark-sql访问hudi表
既然hive外部表能查,spark-sql按道理也能访问 1.把依赖hudi-hadoop-mr-bundle-0.9.0.jar 包放到集群spark/jars下,没有这个包,访问报错。 用spark-shell打开
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> sqlContext.sql(" SELECT * FROM hudi.tb1_trips_cow").collect().foreach(println)
总结
根据我跳坑爬坑的经验,如果在一个问题反复处理太久,人精神容易疲惫,也会有无力感,这时候与其继续纠缠,不如停下来,恢复一下,过一段时间再回来处理,结合之前爬坑经验,往往思路更开阔。