持续创作,加速成长!这是我参与「掘金日新计划 · 10 月更文挑战」的第9天,点击查看活动详情
hudi安装
背景
公司需要测试flink cdc需要一套hudi环境,Hudi是一个流式数据湖平台,使用Hudi可以直接打通数据库与数据仓库,Hudi可以连通Hadoop、hive,支持对数据record粒度的增删改查。Hudi支持同步数据入库,提供了事务保证、索引优化,是打造实时数仓、实时湖仓一体的新一代技术。
环境
hdfs环境:
jdk:1.8+
scala:2.11
hudi:0.10.0
flink:1.13.5
hudi编译
源码下载
git clone https://github.com/apache/hudi.git
版本适配
0.10.0(git分支:release-0.10.0) 适配 flink 1.13.x(建议直接用master分支进行编译,目前master已经到0.11.0,0.10中有些bug在master才解决)
配置文件
修改hudi/pom.xml,在properties修改hadoop、hive版权为自己的版本
fasterxml.spark3.version>2.10.0</fasterxml.spark3.version>
<kafka.version>2.0.0</kafka.version>
<confluent.version>5.3.4</confluent.version>
<glassfish.version>2.17</glassfish.version>
<parquet.version>1.10.1</parquet.version>
<junit.jupiter.version>5.7.0-M1</junit.jupiter.version>
<junit.vintage.version>5.7.0-M1</junit.vintage.version>
<junit.platform.version>1.7.0-M1</junit.platform.version>
<mockito.jupiter.version>3.3.3</mockito.jupiter.version>
<log4j.version>1.2.17</log4j.version>
<slf4j.version>1.7.30</slf4j.version>
<joda.version>2.9.9</joda.version>
<hadoop.version>3.1.1</hadoop.version>
<hive.groupid>org.apache.hive</hive.groupid>
<hive.version>3.1.0</hive.version>
<hive.exec.classifier>core</hive.exec.classifier>
<metrics.version>4.1.1</metrics.version>
<orc.version>1.6.0</orc.version>
<airlift.version>0.16</airlift.version>
<prometheus.version>0.8.0</prometheus.version>
<http.version>4.4.1</http.version>
<spark.version>${spark2.version}</spark.version>
<sparkbundle.version>${spark2bundle.version}</sparkbundle.version>
<flink.version>1.13.5</flink.version>
<spark2.version>2.4.4</spark2.version>
<spark3.version>3.1.2</spark3.version>
<spark2bundle.version></spark2bundle.version>
<spark3bundle.version>3</spark3bundle.version>
<hudi.spark.module>hudi-spark2</hudi.spark.module>
<avro.version>1.8.2</avro.version>
<scala11.version>2.11.12</scala11.version>
<scala12.version>2.12.10</scala12.version>
<scala.version>${scala11.version}</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<apache-rat-plugin.version>0.12</apache-rat-plugin.version>
<scala-maven-plugin.version>3.3.1</scala-maven-plugin.version>
<scalatest.version>3.0.1</scalatest.version>
修改hudi/packaging/hudi-flink-bundle/pom.xml, 在profiles中将flink-bundle-shade-hive3的hive.version修改为自己的版本。如果是hive1.x则修改flink-bundle-shade-hive1,hive2.x则修改flink-bundle-shade-hive2。
执行编译
mvn clean install -DskipTests -Pflink-bundle-shade-hive3
如果不修改pom可以在mvn中增加编译属性
hudi> mvn clean install -DskipTests -Dscala-2.11 -Dhadoop.version=3.1.4 -Dhive.version=3.1.2 -Pflink-bundle-shade-hive3 -Pspark3.1.x
首次编译用时1小时左右,编译成功后再次编译用时在15分钟左右。
说明:编译时增加-Pspark2,-Pspark3.1.x,-Pspark3对编译对应的hudi-spark-bundle版本。
编译异常
[WARNING] warning: While parsing annotations in C:\Users\zhang.m2\repository\org\apache\spark\spark-core_2.11\2.4.4\spark-core_2.11-2.4.4.jar(org/apache/spark/rdd/RDDOperationScope.class), could not find NON_NULL in enum <none>.
[INFO] This is likely due to an implementation restriction: an annotation argument cannot refer to a member of the annotated class (SI-7014).
[ERROR] D:\IdeaProject\hudi\hudi-integ-test\src\main\scala\org\apache\hudi\integ\testsuite\utils\SparkSqlUtils.scala:518: error: Symbol 'term com.fasterxml.jackson.annotation' is missing from the classpath.
[ERROR] This symbol is required by ' <none>'. [ERROR] Make sure that term annotation is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
[ERROR] A full rebuild may help if 'RDDOperationScope.class' was compiled against an incompatible version of com.fasterxml.jackson. [ERROR] .map(record => {
[ERROR] ^
[WARNING] one warning found [ERROR] one error found
由于hudi-integ-test是专项的测试模块,不用于项目实践,可以直接跳过这个模块的编译。修改hudi/pom.xml,将modules里的测试模块注释掉。
<module>hudi-timeline-service</module>
<module>hudi-utilities</module>
<module>hudi-sync</module>
<module>packaging/hudi-hadoop-mr-bundle</module>
<module>packaging/hudi-hive-sync-bundle</module>
<module>packaging/hudi-spark-bundle</module>
<module>packaging/hudi-presto-bundle</module>
<module>packaging/hudi-utilities-bundle</module>
<module>packaging/hudi-timeline-server-bundle</module>
<module>docker/hoodie/hadoop</module>
<!--<module>hudi-integ-test</module>
<module>packaging/hudi-integ-test-bundle</module>-->
<module>hudi-examples</module>
<module>hudi-flink</module>
<module>hudi-kafka-connect</module>
<module>packaging/hudi-flink-bundle</module>
<module>packaging/hudi-kafka-connect-bundle</module>
</modules>
编译结果
主要使用两个文件:
1、hudi-flink-bundle_2.11-0.10.0.jar,用于flink读写hudi数据,文件位置:hudi/packaging/hudi-flink-bundle/target/hudi-flink-bundle_2.11-0.10.0.jar。
2、hudi-hadoop-mr-bundle-0.10.0.jar,用于hive读取hudi数据,文件位置:hudi/packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.10.0.jar