使用Spark操作Hudi表详细教程_spark读取hudi,作为大数据开发程序员应该怎样去规划自己的学习路线

53 阅读5分钟

img img img

既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,涵盖了95%以上大数据知识点,真正体系化!

由于文件比较多,这里只是将部分目录截图出来,全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频,并且后续会持续更新

需要这份系统化资料的朋友,可以戳这里获取

createOrReplaceTempView("hudi_mor_tbl_shell")

val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from hudi_mor_tbl_shell order by commitTime desc").map(k => k.getString(0)).take(50) val beginTime = commits(commits.length - 1)

val idf = spark.read.format("hudi"). option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL). option(BEGIN_INSTANTTIME_OPT_KEY, beginTime). load("hdfs:///hudi/hudi_mor_tbl_shell") idf.createOrReplaceTempView("hudi_mor_tbl_shell_incremental")

spark.sql("select \_hoodie\_commit\_time, id, name, price, ts from hudi_mor_tbl_shell_incremental").show()


发现只取出了最近插入/修改后的数据。



### 修改数据



import org.apache.spark.sql._ import org.apache.spark.sql.types._ val fields = Array( StructField("id", IntegerType, true), StructField("name", StringType, true), StructField("price", DoubleType, true), StructField("ts", LongType, true) ) val simpleSchema = StructType(fields) val data = Seq(Row(2, "a2", 400.0, 2222L)) val df = spark.createDataFrame(data, simpleSchema)

df.write.format("hudi"). option(PRECOMBINE_FIELD_OPT_KEY, "ts"). option(RECORDKEY_FIELD_OPT_KEY, "id"). option(TABLE_NAME, "hudi_mor_tbl_shell"). option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ"). mode(Append). save("hdfs:///hudi/hudi_mor_tbl_shell")


验证方法使用普通查询。



### Insert overwrite



import org.apache.spark.sql._ import org.apache.spark.sql.types._ val fields = Array( StructField("id", IntegerType, true), StructField("name", StringType, true), StructField("price", DoubleType, true), StructField("ts", LongType, true) ) val simpleSchema = StructType(fields) val data = Seq(Row(99, "a99", 20.0, 900L)) val df = spark.createDataFrame(data, simpleSchema)

df.write.format("hudi"). option(OPERATION.key(),"insert_overwrite"). option(PRECOMBINE_FIELD.key(), "ts"). option(RECORDKEY_FIELD.key(), "id"). option(TBL_NAME.key(), "hudi_mor_tbl_shell"). option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ"). mode(Append). save("hdfs:///hudi/hudi_mor_tbl_shell")


验证方法使用普通查询。发现只有新增的这一条数据。



### 删除数据



import org.apache.spark.sql._ import org.apache.spark.sql.types._ val fields = Array( StructField("id", IntegerType, true), StructField("name", StringType, true), StructField("price", DoubleType, true), StructField("ts", LongType, true) ) val simpleSchema = StructType(fields) val data = Seq(Row(2, "a2", 400.0, 2222L)) val df = spark.createDataFrame(data, simpleSchema)

df.write.format("hudi"). option(OPERATION_OPT_KEY,"delete"). option(PRECOMBINE_FIELD_OPT_KEY, "ts"). option(RECORDKEY_FIELD_OPT_KEY, "id"). option(TABLE_NAME, "hudi_mor_tbl_shell"). mode(Append). save("hdfs:///hudi/hudi_mor_tbl_shell")


验证方法使用普通查询。



### Spark SQL方式


启动Hudi spark sql的方法:



./spark-sql
--master yarn
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'


如果使用Hudi的版本为0.11.x,需要执行:



./spark-sql
--master yarn
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'


创建表:



create table hudi_mor_tbl ( id int, name string, price double, ts bigint ) using hudi tblproperties ( type = 'mor', primaryKey = 'id', preCombineField = 'ts' ) location 'hdfs:///hudi/hudi_mor_tbl';


验证:



show tables;


### 插入数据


SQL方式:



insert into hudi_mor_tbl select 1, 'a1', 20, 1000;


验证:



select * from hudi_mor_tbl;


### 普通查询


SQL方式:



select * from hudi_mor_tbl;


### 修改数据


SQL方式:



update hudi_mor_tbl set price = price * 2, ts = 1111 where id = 1;


验证:



select * from hudi_mor_tbl;


### insert overwrite


SQL方式:



insert overwrite hudi_mor_tbl select 99, 'a99', 20.0, 900;


验证:



select * from hudi_mor_tbl;


发现只有新增的这一条数据。



### 删除数据


SQL方式:



delete from hudi_mor_tbl where id % 2 = 1;


验证:



select * from hudi_mor_tbl;


### Kerberos和权限配置


例如,如果要允许Hudi用户对Hudi表进行操作,提交队列为default,表路径为hdfs:///hudi/t1,可以通过以下步骤使用Ranger进行设置:


1、在Ranger中创建一个名为hudi的用户。


2、分配给hudi用户以下目录的读写权限:/hdfs/hudi/t1,/tmp,/user/hudi。


3、赋予hudi用户对yarn default队列的权限。


如果启用了Kerberos,还需要执行以下额外步骤:


1、在Kerberos中创建hudi@PAULTECH.COM主体,并生成相应的keytab文件。


2、在执行kinit之后,确保hudi用户具有相应的权限以执行相关操作。


通过这些设置,Hudi用户应该能够在指定的表路径下执行操作,并具有必要的HDFS和YARN权限,确保了对应用程序的顺利运行。


### FAQ


**1、spark-sql或者spark-shell启动出现NoClassDefFoundError: org/apache/hadoop/shaded/javax/ws/rs/core/NoContentException**


问题日志:



Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/javax/ws/rs/core/NoContentException at org.apache.hadoop.yarn.util.timeline.TimelineUtils.(TimelineUtils.java:60) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:200) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:191) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:222) at org.apache.spark.SparkContext.(SparkContext.scala:585) at org.apache.spark.SparkContext.getOrCreate(SparkContext.scala:2704)atorg.apache.spark.sql.SparkSession.getOrCreate(SparkContext.scala:2704) at org.apache.spark.sql.SparkSessionBuilder.anonfunanonfungetOrCreate2(SparkSession.scala:953)atscala.Option.getOrElse(Option.scala:189)atorg.apache.spark.sql.SparkSession2(SparkSession.scala:953) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSessionBuilder.getOrCreate(SparkSession.scala:947) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv.init(SparkSQLEnv.scala:54)atorg.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.<init>(SparkSQLCLIDriver.scala:327)atorg.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.init(SparkSQLEnv.scala:54) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.<init>(SparkSQLCLIDriver.scala:327) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala:159) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.orgapacheapachesparkdeploydeploySparkSubmitrunMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmitanon2.doSubmit(SparkSubmit.scala:1046)atorg.apache.spark.deploy.SparkSubmit2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 27 more


问题原因:HadoopSpark版本不匹配所致。  
 解决方案:可禁用Yarntimeline-service。禁用方法请看环境配置。  
 参考链接:  
 <https://github.com/apache/kyuubi/issues/2904>


**2、创建表的时候出现 CreateHoodieTableCommand: Failed to create catalog table in metastore: org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat**


从原始报错看不出来是什么问题,需要增加代码:


hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala


85行左右修改为:



case NonFatal(e) => { logWarning(s"Failed to create catalog table in metastore: e.getMessage")logWarning(s"Failedtocreatecatalogtableinmetastore:{e.getMessage}") logWarning(s"Failed to create catalog table in metastore: {e.getClass}") logWarning(s"Failed to create catalog table in metastore: ${e.getStackTrace.mkString("Array(", ", ", ")")}") }


编译替换后再次运行。可看到更为详细的报错日志:  
 org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat。经过查找,发现这个class在hudi-hadoop-mr-bundle包中。  
 **将Hudi编译后的hudi-hadoop-mr-bundle-0.13.1.jar放入到hive安装目录的lib或者auxlib中。重启Hive metastore服务后恢复正常。**


**3、spark-sql或者spark-shell命令太长,每次都要加入Hudi必须的conf配置,可否简化**  
 有办法简化,可以将Hudi的配置加入到spark-defaults.conf配置文件中。例如对于Hudi 0.13.1版本可在spark-defaults.conf中加入:



spark.serializer=org.apache.spark.serializer.KryoSerializer spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension


修改之后在启动spark-shell只需要执行:



./spark-shell --master yarn


对于spark-sql,执行:



./spark-sql --master yarn


**更多文章请扫码关注公众号,有问题的小伙伴也可以在公众号上提出。**  
 ![请添加图片描述](https://p9-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/e7c691403578415fa666fb5bde5270a2~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg55So5oi3MzM5MTQ5MjgwNjA=:q75.awebp?rk3s=f64ab15b&x-expires=1771330383&x-signature=TX0KEnESnQv50HS%2BpZMjyK52P0E%3D)




![img](https://p9-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/9d10d3f77da74933a60e4bb4af521fe4~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg55So5oi3MzM5MTQ5MjgwNjA=:q75.awebp?rk3s=f64ab15b&x-expires=1771330383&x-signature=zNAoh6s5H%2Fbe8z%2FG9FgFsu5YehY%3D)
![img](https://p9-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/08372771e669439faa362280ad429f34~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg55So5oi3MzM5MTQ5MjgwNjA=:q75.awebp?rk3s=f64ab15b&x-expires=1771330383&x-signature=o5u2gSSNf72BbKysGgJn9k5q9NU%3D)
![img](https://p9-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/0344725f5524443fa990001ba00a769d~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg55So5oi3MzM5MTQ5MjgwNjA=:q75.awebp?rk3s=f64ab15b&x-expires=1771330383&x-signature=%2BPvWKsYNohXyzWpwgEzIqDVhOfs%3D)

**既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,涵盖了95%以上大数据知识点,真正体系化!**


**由于文件比较多,这里只是将部分目录截图出来,全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频,并且后续会持续更新**

**[需要这份系统化资料的朋友,可以戳这里获取](https://gitee.com/vip204888)**