hudi 安装

947 阅读2分钟

持续创作,加速成长!这是我参与「掘金日新计划 · 10 月更文挑战」的第9天,点击查看活动详情

hudi安装

背景

公司需要测试flink cdc需要一套hudi环境,Hudi是一个流式数据湖平台,使用Hudi可以直接打通数据库与数据仓库,Hudi可以连通Hadoop、hive,支持对数据record粒度的增删改查。Hudi支持同步数据入库,提供了事务保证、索引优化,是打造实时数仓、实时湖仓一体的新一代技术。

环境

hdfs环境: image.png jdk:1.8+ scala:2.11 hudi:0.10.0 flink:1.13.5

hudi编译

源码下载

git clone https://github.com/apache/hudi.git

版本适配

0.10.0(git分支:release-0.10.0) 适配 flink 1.13.x(建议直接用master分支进行编译,目前master已经到0.11.0,0.10中有些bug在master才解决)

配置文件

修改hudi/pom.xml,在properties修改hadoop、hive版权为自己的版本

fasterxml.spark3.version>2.10.0</fasterxml.spark3.version>
    <kafka.version>2.0.0</kafka.version>
    <confluent.version>5.3.4</confluent.version>
    <glassfish.version>2.17</glassfish.version>
    <parquet.version>1.10.1</parquet.version>
    <junit.jupiter.version>5.7.0-M1</junit.jupiter.version>
    <junit.vintage.version>5.7.0-M1</junit.vintage.version>
    <junit.platform.version>1.7.0-M1</junit.platform.version>
    <mockito.jupiter.version>3.3.3</mockito.jupiter.version>
    <log4j.version>1.2.17</log4j.version>
    <slf4j.version>1.7.30</slf4j.version>
    <joda.version>2.9.9</joda.version>
    <hadoop.version>3.1.1</hadoop.version>
    <hive.groupid>org.apache.hive</hive.groupid>
    <hive.version>3.1.0</hive.version>
    <hive.exec.classifier>core</hive.exec.classifier>
    <metrics.version>4.1.1</metrics.version>
    <orc.version>1.6.0</orc.version>
    <airlift.version>0.16</airlift.version>
    <prometheus.version>0.8.0</prometheus.version>
    <http.version>4.4.1</http.version>
    <spark.version>${spark2.version}</spark.version>
    <sparkbundle.version>${spark2bundle.version}</sparkbundle.version>
    <flink.version>1.13.5</flink.version>
    <spark2.version>2.4.4</spark2.version>
    <spark3.version>3.1.2</spark3.version>
    <spark2bundle.version></spark2bundle.version>
     <spark3bundle.version>3</spark3bundle.version>
    <hudi.spark.module>hudi-spark2</hudi.spark.module>
    <avro.version>1.8.2</avro.version>
    <scala11.version>2.11.12</scala11.version>
     <scala12.version>2.12.10</scala12.version>
    <scala.version>${scala11.version}</scala.version>
    <scala.binary.version>2.11</scala.binary.version>
    <apache-rat-plugin.version>0.12</apache-rat-plugin.version>
    <scala-maven-plugin.version>3.3.1</scala-maven-plugin.version>
    <scalatest.version>3.0.1</scalatest.version>

修改hudi/packaging/hudi-flink-bundle/pom.xml, 在profiles中将flink-bundle-shade-hive3的hive.version修改为自己的版本。如果是hive1.x则修改flink-bundle-shade-hive1,hive2.x则修改flink-bundle-shade-hive2。

执行编译

mvn clean install -DskipTests -Pflink-bundle-shade-hive3

如果不修改pom可以在mvn中增加编译属性

hudi> mvn clean install -DskipTests -Dscala-2.11 -Dhadoop.version=3.1.4 -Dhive.version=3.1.2 -Pflink-bundle-shade-hive3 -Pspark3.1.x

首次编译用时1小时左右,编译成功后再次编译用时在15分钟左右。

说明:编译时增加-Pspark2,-Pspark3.1.x,-Pspark3对编译对应的hudi-spark-bundle版本。

编译异常

[WARNING] warning: While parsing annotations in C:\Users\zhang.m2\repository\org\apache\spark\spark-core_2.11\2.4.4\spark-core_2.11-2.4.4.jar(org/apache/spark/rdd/RDDOperationScope.class), could not find NON_NULL in enum <none>. 
[INFO] This is likely due to an implementation restriction: an annotation argument cannot refer to a member of the annotated class (SI-7014). 
[ERROR] D:\IdeaProject\hudi\hudi-integ-test\src\main\scala\org\apache\hudi\integ\testsuite\utils\SparkSqlUtils.scala:518: error: Symbol 'term com.fasterxml.jackson.annotation' is missing from the classpath. 
[ERROR] This symbol is required by ' <none>'. [ERROR] Make sure that term annotation is in your classpath and check for conflicting dependencies with `-Ylog-classpath`. 
[ERROR] A full rebuild may help if 'RDDOperationScope.class' was compiled against an incompatible version of com.fasterxml.jackson. [ERROR]       .map(record => { 
[ERROR]                   ^ 
[WARNING] one warning found [ERROR] one error found

由于hudi-integ-test是专项的测试模块,不用于项目实践,可以直接跳过这个模块的编译。修改hudi/pom.xml,将modules里的测试模块注释掉。

<module>hudi-timeline-service</module>
    <module>hudi-utilities</module>
    <module>hudi-sync</module>
    <module>packaging/hudi-hadoop-mr-bundle</module>
    <module>packaging/hudi-hive-sync-bundle</module>
    <module>packaging/hudi-spark-bundle</module>
    <module>packaging/hudi-presto-bundle</module>
    <module>packaging/hudi-utilities-bundle</module>
    <module>packaging/hudi-timeline-server-bundle</module>
    <module>docker/hoodie/hadoop</module>
    <!--<module>hudi-integ-test</module>
    <module>packaging/hudi-integ-test-bundle</module>-->
    <module>hudi-examples</module>
    <module>hudi-flink</module>
    <module>hudi-kafka-connect</module>
    <module>packaging/hudi-flink-bundle</module>
    <module>packaging/hudi-kafka-connect-bundle</module>
  </modules>

image.png

编译结果

主要使用两个文件:

1、hudi-flink-bundle_2.11-0.10.0.jar,用于flink读写hudi数据,文件位置:hudi/packaging/hudi-flink-bundle/target/hudi-flink-bundle_2.11-0.10.0.jar。

2、hudi-hadoop-mr-bundle-0.10.0.jar,用于hive读取hudi数据,文件位置:hudi/packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.10.0.jar