@[TOC]
系列文章索引
Flink从入门到实践(一):Flink入门、Flink部署 Flink从入门到实践(二):Flink DataStream API Flink从入门到实践(三):数据实时采集 - Flink MySQL CDC
一、概述
1、版本匹配
注意MySQL的版本,本次是使用MySQL8.0进行演示。
同时,Flink支持很多数据库的cdc。
同时也要对应好版本,我们本次使用Flink是1.18,同时FlinkCDC也是3.0版本
2、导包
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>1.18.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java</artifactId>
<version>1.18.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-base</artifactId>
<version>1.18.0</version>
</dependency>
<dependency>
<groupId>com.ververica</groupId>
<artifactId>flink-connector-mysql-cdc</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.27</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-runtime</artifactId>
<version>1.18.0</version>
</dependency>
二、编码实现
1、基本使用
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;
/**
* Flink MySql CDC
* 每次启动之后,会将所有数据采集一遍
*/
public class FlinkCDC01 {
public static void main(String[] args) throws Exception {
MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
.hostname("192.168.56.10")
.port(3306)
.databaseList("testdb") // 要监听的数据库,可以填多个,支持正则表达式
.tableList("testdb.access") // 监听的表,可以填多个,需要db.表,支持正则表达式
.username("root")
.password("root")
.deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
.build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 开启检查点
env.enableCheckpointing(3000);
env
.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source")
// 1个并行任务
.setParallelism(1)
.print()
.setParallelism(1); // 对接收器使用并行性1来保持消息顺序
env.execute("Print MySQL Snapshot + Binlog");
}
}
结果是json数据:
{ "before": null, "after": { "id": 1, "name": "1" }, "source": { "version": "1.9.7.Final", "connector": "mysql", "name": "mysql_binlog_source", "ts_ms": 1707353812000, "snapshot": "false", "db": "testdb", // 库名 "sequence": null, "table": "access", // 表名 "server_id": 1, "gtid": null, "file": "binlog.000005", "pos": 374, "row": 0, "thread": 9, "query": null }, "op": "c", // 操作 c是create;u是update;d是delete;r是read "ts_ms": 1707353812450, "transaction": null }
2、更多配置
https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mysql-cdc%28ZH%29.html
配置选项scan.startup.mode指定 MySQL CDC 使用者的启动模式。有效枚举包括: initial (默认):在第一次启动时对受监视的数据库表执行初始快照,并继续读取最新的 binlog。 earliest-offset:跳过快照阶段,从可读取的最早 binlog 位点开始读取 latest-offset:首次启动时,从不对受监视的数据库表执行快照, 连接器仅从 binlog 的结尾处开始读取,这意味着连接器只能读取在连接器启动之后的数据更改。 specific-offset:跳过快照阶段,从指定的 binlog 位点开始读取。位点可通过 binlog 文件名和位置指定,或者在 GTID 在集群上启用时通过 GTID 集合指定。 timestamp:跳过快照阶段,从指定的时间戳开始读取 binlog 事件。
3、自定义序列化器
import com.ververica.cdc.debezium.DebeziumDeserializationSchema;
import io.debezium.data.Envelope;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.util.Collector;
import org.apache.kafka.connect.data.Field;
import org.apache.kafka.connect.data.Struct;
import org.apache.kafka.connect.source.SourceRecord;
import java.util.List;
public class DomainDeserializationSchema implements DebeziumDeserializationSchema<String> {
@Override
public void deserialize(SourceRecord sourceRecord, Collector<String> collector) throws Exception {
String topic = sourceRecord.topic();
String[] split = topic.split("\\.");
System.out.println("数据库:" + split[1]);
System.out.println("表:" + split[2]);
Struct value = (Struct)sourceRecord.value();
// 获取before信息
Struct before = value.getStruct("before");
System.out.println("before:" + before);
if (before != null) {
// 所有字段
List<Field> fields = before.schema().fields();
for (Field field : fields) {
System.out.println("before field:" + field.name() + " value:" + before.get(field));
}
}
// 获取after信息
Struct after = value.getStruct("after");
System.out.println("after:" + after);
if (after != null) {
// 所有字段
List<Field> fields = after.schema().fields();
for (Field field : fields) {
System.out.println("after field:" + field.name() + " value:" + after.get(field));
}
}
// 操作类型
Envelope.Operation operation = Envelope.operationFor(sourceRecord);
System.out.println("操作:" + operation);
// 收集序列化后的结果
collector.collect("aaaaaaaaaaaaa");
}
@Override
public TypeInformation<String> getProducedType() {
return BasicTypeInfo.STRING_TYPE_INFO; // 类型
}
}
MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
.hostname("192.168.56.10")
.port(3306)
.databaseList("testdb") // 要监听的数据库,可以填多个
.tableList("testdb.access") // 监听的表,可以填多个
.username("root")
.password("root")
.deserializer(new DomainDeserializationSchema()) // 序列化器
.build();
4、Flink SQL方式
CDC用的少,还是StreamAPI用的多。
三、踩坑
1、The MySQL server has a timezone offset (0 seconds ahead of UTC) which does not match the configured timezone Asia/Shanghai.
2024-02-08 08:36:33 INFO 5217 --- [lt-dispatcher-6] o.a.f.r.executiongraph.ExecutionGraph : Source: MySQL Source -> Sink: Print to Std. Out (1/1) (e2371dabd0c952a5dfa7c053cbde80c3_cbc357ccb763df2852fee8c4fc7d55f2_0_2) switched from CREATED to SCHEDULED. 2024-02-08 08:36:33 INFO 5217 --- [lt-dispatcher-8] o.a.f.r.r.s.FineGrainedSlotManager : Received resource requirements from job 369b1c979674a0444f679dd13264ea88: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=1}] 2024-02-08 08:36:33 INFO 5218 --- [lt-dispatcher-6] o.a.flink.runtime.jobmaster.JobMaster : Trying to recover from a global failure. org.apache.flink.util.FlinkException: Global failure triggered by OperatorCoordinator for 'Source: MySQL Source -> Sink: Print to Std. Out' (operator cbc357ccb763df2852fee8c4fc7d55f2). at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolderQuiesceableContext.failJob(RecreateOnResetOperatorCoordinator.java:248) at org.apache.flink.runtime.source.coordinator.SourceCoordinatorContext.failJob(SourceCoordinatorContext.java:395) at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:225) at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinatorresetToCheckpointnullUniRun.tryFire(CompletableFuture.java:701) at java.util.concurrent.CompletableFuturehandleRunAsync(PartialFunction.scala:126) at org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29) at scala.PartialFunctionOrElse.applyOrElse(PartialFunction.scala:176) at scala.PartialFunction(Actor.scala:545) at org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229) at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590) at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557) at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280) at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241) at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1067) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1703) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:172) Caused by: org.apache.flink.table.api.ValidationException: The MySQL server has a timezone offset (0 seconds ahead of UTC) which does not match the configured timezone Asia/Shanghai. Specify the right server-time-zone to avoid inconsistencies for time-related fields. at com.ververica.cdc.connectors.mysql.MySqlValidator.checkTimeZone(MySqlValidator.java:184) at com.ververica.cdc.connectors.mysql.MySqlValidator.validate(MySqlValidator.java:72) at com.ververica.cdc.connectors.mysql.source.MySqlSource.createEnumerator(MySqlSource.java:197) at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:221) ... 42 common frames omitted
查看mysql:
show variables like '%time_zone%';
解决方案:
SET time_zone = 'Asia/Shanghai';
SET @@global.time_zone = 'Asia/Shanghai';
#再次查看
SELECT @@global.time_zone;
show variables like '%time_zone%';
参考资料
源码:https://github.com/ververica/flink-cdc-connectors
文档:https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
官网:https://ververica.github.io/flink-cdc-connectors/