Hudi-0.9 Flink-1.12 Quick-Start Guide

1,458 阅读4分钟
  • 大部分内容来源官方指南,增加的部分内容是为顺利跑通指南

官网文档:   Guide

简介

This guide provides a quick peek at Hudi’s capabilities using flink SQL client. Using flink SQL, we will walk through code snippets that allows you to insert and update a Hudi table of default table type: Copy on Write and Merge On Read. After each write operation we will also show how to read the data snapshot (incrementally read is already on the roadmap).

安装

We use the Flink Sql Client because it’s a good quick start tool for SQL users.

Step.1 download flink jarPermalink

Hudi works with Flink-1.11.x version. You can follow instructions here for setting up flink. The hudi-flink-bundle jar is archived with scala 2.11, so it’s recommended to use flink 1.11 bundled with scala 2.11.

Step.2 start flink clusterPermalink

Start a standalone flink cluster within hadoop environment. Before you start up the cluster, we suggest to config the cluster as follows:

  • in $FLINK_HOME/conf/flink-conf.yaml, add config option taskmanager.numberOfTaskSlots: 4
  • in $FLINK_HOME/conf/workers, add item localhost as 4 lines so that there are 4 workers on the local cluster

Now starts the cluster:

# HADOOP_HOME is your hadoop root directory after unpack the binary package.
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`

# Start the flink standalone cluster
./bin/start-cluster.sh

Step.3 start flink SQL clientPermalink

Hudi has a prepared bundle jar for flink, which should be loaded in the flink SQL Client when it starts up. You can build the jar manually under path hudi-source-dir/packaging/hudi-flink-bundle, or download it from the Apache Official Repository.

Now starts the SQL CLI:

# HADOOP_HOME is your hadoop root directory after unpack the binary package.
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`

./bin/sql-client.sh embedded -j .../hudi-flink-bundle_2.1?-*.*.*.jar shell

Please note the following:

  • We suggest hadoop 2.9.x+ version because some of the object storage has filesystem implementation only after that
  • The flink-parquet and flink-avro formats are already packaged into the hudi-flink-bundle jar

Setup table name, base path and operate using SQL for this guide. The SQL CLI only executes the SQL line by line.

自行搭配

按照官方说hudi目前的0.7版本支持Flink1.11

但是笔者想用Flink1.12-scala2.11 来和hudi一起作业, 所以这里下载Flink1.12-scala2.11  。

cd /opt/
wget https://mirrors.tuna.tsinghua.edu.cn/apache/flink/flink-1.12.2/flink-1.12.2-bin-scala_2.11.tgz 
tar xf flink-1.12.2-bin-scala_2.11.tgz

然后hudi 目前的 release 0.7.0,并不支持Flink1.12, 所以这里需要用hudi 0.9.0 。因为还没有发布,需要自行编译安装

git clone https://github.com/apache/hudi.git 
cd hudi
mvn clean package -DskipTests

将打包出来的包放到Flink 目录

scp packaging/hudi-flink-bundle/target/hudi-flink-bundle_2.11-0.9.0-SNAPSHOT.jar /opt/flink.1.12.2/

mvn安装请自行baidu  并配置aliyun 源,就不再此文说明了

配置

配置Flink

在Flink1.12中集成hadoop的配置和lib包,官网中推荐使用HADOOP_CLASSPATH

echo "export HADOOP_CLASSPATH=`hadoop classpath`" >> /etc/profile
source /etc/profile

配置taskmanager数量

  • vim ./conf/flink-conf.yaml, add config option taskmanager.numberOfTaskSlots: 4 
  • vim ./conf/workers, add item localhost as 4 lines so that there are 4

启动  方式一

启动Flink  standalone cluster

cd /opt/flink-1.12.2
./bin/start-cluster.sh
#查看5个进程是否已启动
jps

#启动sql-client
./bin/sql-client.sh embedded -j ./hudi-flink-bundle_2.11-0.9.0-SNAPSHOT.jar shell

启动 方式二

不启动Flink  使用yarn 模式

cd /opt/flink-1.12.2
#启动一个集群在yarn
./bin/yarn-session.sh -s 4 -jm 2048 -tm 4096 -nm flink1.12.2-hudi -d 
#启动sql-client.sh 跑任务
./bin/sql-client.sh embedded -s yarn-session -j ./hudi-flink-bundle_2.11-0.9.0-SNAPSHOT.jar shell
#非yarn-session模式跑任务
./bin/sql-client.sh embedded -s default -j ./hudi-flink-bundle_2.11-0.9.0-SNAPSHOT.jar shell

Insert data

这里schema 我们用的hdfs ,因为使用的CDH集群hdfs端口是8020,如果你是单机hadoop集群,hdfs的端口是9000

#sets up the result mode to tableau to show the results directly in the CLI
set execution.result-mode=tableau;

CREATE TABLE t1(
  uuid VARCHAR(20),
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector'= 'hudi',
'path'= 'hdfs://hadoop1:8020/hudi/t1',
'table.type'= 'MERGE_ON_READ'
);

#insert data using values
INSERT INTO t1 VALUES
('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');

Update data

#查询表数据
select * from t1;

#更数据数this would update the record with key 'id1'
insert into t1 values ('id1','Danny',27,TIMESTAMP '1970-01-01 00:00:01','par1');

  • Notice that the save mode is now Append. In general, always use append mode unless you are trying to create the table for the first time. Querying the data again will now show updated records. Each write operation generates a new commit denoted by the timestamp. Look for changes in _hoodie_commit_time, age fields for the same _hoodie_record_keys in previous commit.

Streaming query

Hudi flink also provides capability to obtain a stream of records that changed since given commit timestamp. This can be achieved using Hudi’s streaming querying and providing a start time from which changes need to be streamed. We do not need to specify endTime, if we want all changes after the given commit (as is the common case).

CREATE TABLE s_t1(
  uuid VARCHAR(20),
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector'= 'hudi',
'path'= 'hdfs://hadoop1:8020/hudi/t1',
'table.type'= 'MERGE_ON_READ',
'read.streaming.enabled'= 'true',  
'read.streaming.start-commit'= '20210402145422',
'read.streaming.check-interval'= '4'
);

不能在当前sql-client创建t1表,会报表存在错误 这里创建一个s_t1 数据源和t1一样

更新数据,观察数据流

insert into t1 values ('id9','test',27,TIMESTAMP '1970-01-01 00:00:01','par5');

select * from s_t1;

这个查看流 可以用Ctrl+ c 中断 

访问Flink  UI界面   hadoop1:8081, 查看相关job状态

在hdfs中的数据状态

在hdfs中的数据内容