Flink从入门到实践(三):数据实时采集 - Flink MySQL CDC

314 阅读4分钟

@[TOC]

系列文章索引

Flink从入门到实践(一):Flink入门、Flink部署 Flink从入门到实践(二):Flink DataStream API Flink从入门到实践(三):数据实时采集 - Flink MySQL CDC

一、概述

1、版本匹配

注意MySQL的版本,本次是使用MySQL8.0进行演示。 同时,Flink支持很多数据库的cdc。 在这里插入图片描述 同时也要对应好版本,我们本次使用Flink是1.18,同时FlinkCDC也是3.0版本 在这里插入图片描述

2、导包

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients</artifactId>
    <version>1.18.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-java</artifactId>
    <version>1.18.0</version>
</dependency>

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-base</artifactId>
    <version>1.18.0</version>
</dependency>

<dependency>
    <groupId>com.ververica</groupId>
    <artifactId>flink-connector-mysql-cdc</artifactId>
    <version>3.0.0</version>
</dependency>

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>8.0.27</version>
</dependency>

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-runtime</artifactId>
    <version>1.18.0</version>
</dependency>


二、编码实现

1、基本使用

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;

/**
 * Flink MySql CDC
 * 每次启动之后,会将所有数据采集一遍
 */
public class FlinkCDC01 {
    public static void main(String[] args) throws Exception {
        MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
                .hostname("192.168.56.10")
                .port(3306)
                .databaseList("testdb") // 要监听的数据库,可以填多个,支持正则表达式
                .tableList("testdb.access") // 监听的表,可以填多个,需要db.表,支持正则表达式
                .username("root")
                .password("root")
                .deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String
                .build();

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 开启检查点
        env.enableCheckpointing(3000);

        env
            .fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source")
            // 1个并行任务
            .setParallelism(1)
            .print()
            .setParallelism(1); // 对接收器使用并行性1来保持消息顺序

        env.execute("Print MySQL Snapshot + Binlog");
    }
}

结果是json数据:

{ "before": null, "after": { "id": 1, "name": "1" }, "source": { "version": "1.9.7.Final", "connector": "mysql", "name": "mysql_binlog_source", "ts_ms": 1707353812000, "snapshot": "false", "db": "testdb", // 库名 "sequence": null, "table": "access", // 表名 "server_id": 1, "gtid": null, "file": "binlog.000005", "pos": 374, "row": 0, "thread": 9, "query": null }, "op": "c", // 操作 c是create;u是update;d是delete;r是read "ts_ms": 1707353812450, "transaction": null }

2、更多配置

https://ververica.github.io/flink-cdc-connectors/master/content/connectors/mysql-cdc%28ZH%29.html

配置选项scan.startup.mode指定 MySQL CDC 使用者的启动模式。有效枚举包括: initial (默认):在第一次启动时对受监视的数据库表执行初始快照,并继续读取最新的 binlog。 earliest-offset:跳过快照阶段,从可读取的最早 binlog 位点开始读取 latest-offset:首次启动时,从不对受监视的数据库表执行快照, 连接器仅从 binlog 的结尾处开始读取,这意味着连接器只能读取在连接器启动之后的数据更改。 specific-offset:跳过快照阶段,从指定的 binlog 位点开始读取。位点可通过 binlog 文件名和位置指定,或者在 GTID 在集群上启用时通过 GTID 集合指定。 timestamp:跳过快照阶段,从指定的时间戳开始读取 binlog 事件。

3、自定义序列化器

import com.ververica.cdc.debezium.DebeziumDeserializationSchema;
import io.debezium.data.Envelope;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.util.Collector;
import org.apache.kafka.connect.data.Field;
import org.apache.kafka.connect.data.Struct;
import org.apache.kafka.connect.source.SourceRecord;

import java.util.List;

public class DomainDeserializationSchema implements DebeziumDeserializationSchema<String> {


    @Override
    public void deserialize(SourceRecord sourceRecord, Collector<String> collector) throws Exception {

        String topic = sourceRecord.topic();
        String[] split = topic.split("\\.");
        System.out.println("数据库:" + split[1]);
        System.out.println("表:" + split[2]);

        Struct value = (Struct)sourceRecord.value();
        // 获取before信息
        Struct before = value.getStruct("before");
        System.out.println("before:" + before);
        if (before != null) {
            // 所有字段
            List<Field> fields = before.schema().fields();
            for (Field field : fields) {
                System.out.println("before field:" + field.name() + " value:" + before.get(field));
            }
        }
        // 获取after信息
        Struct after = value.getStruct("after");
        System.out.println("after:" + after);
        if (after != null) {
            // 所有字段
            List<Field> fields = after.schema().fields();
            for (Field field : fields) {
                System.out.println("after field:" + field.name() + " value:" + after.get(field));
            }
        }
        // 操作类型
        Envelope.Operation operation = Envelope.operationFor(sourceRecord);
        System.out.println("操作:" + operation);

        // 收集序列化后的结果
        collector.collect("aaaaaaaaaaaaa");
    }

    @Override
    public TypeInformation<String> getProducedType() {
        return BasicTypeInfo.STRING_TYPE_INFO; // 类型
    }
}

MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
        .hostname("192.168.56.10")
        .port(3306)
        .databaseList("testdb") // 要监听的数据库,可以填多个
        .tableList("testdb.access") // 监听的表,可以填多个
        .username("root")
        .password("root")
        .deserializer(new DomainDeserializationSchema()) // 序列化器
        .build();

4、Flink SQL方式

CDC用的少,还是StreamAPI用的多。

三、踩坑

1、The MySQL server has a timezone offset (0 seconds ahead of UTC) which does not match the configured timezone Asia/Shanghai.

2024-02-08 08:36:33 INFO 5217 --- [lt-dispatcher-6] o.a.f.r.executiongraph.ExecutionGraph : Source: MySQL Source -> Sink: Print to Std. Out (1/1) (e2371dabd0c952a5dfa7c053cbde80c3_cbc357ccb763df2852fee8c4fc7d55f2_0_2) switched from CREATED to SCHEDULED. 2024-02-08 08:36:33 INFO 5217 --- [lt-dispatcher-8] o.a.f.r.r.s.FineGrainedSlotManager : Received resource requirements from job 369b1c979674a0444f679dd13264ea88: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=1}] 2024-02-08 08:36:33 INFO 5218 --- [lt-dispatcher-6] o.a.flink.runtime.jobmaster.JobMaster : Trying to recover from a global failure. org.apache.flink.util.FlinkException: Global failure triggered by OperatorCoordinator for 'Source: MySQL Source -> Sink: Print to Std. Out' (operator cbc357ccb763df2852fee8c4fc7d55f2). at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolderLazyInitializedCoordinatorContext.failJob(OperatorCoordinatorHolder.java:624)atorg.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinatorLazyInitializedCoordinatorContext.failJob(OperatorCoordinatorHolder.java:624) at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinatorQuiesceableContext.failJob(RecreateOnResetOperatorCoordinator.java:248) at org.apache.flink.runtime.source.coordinator.SourceCoordinatorContext.failJob(SourceCoordinatorContext.java:395) at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:225) at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinatorDeferrableCoordinator.resetAndStart(RecreateOnResetOperatorCoordinator.java:416)atorg.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.lambdaDeferrableCoordinator.resetAndStart(RecreateOnResetOperatorCoordinator.java:416) at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.lambdaresetToCheckpoint7(RecreateOnResetOperatorCoordinator.java:156)atjava.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)atjava.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)atjava.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)atorg.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.resetToCheckpoint(RecreateOnResetOperatorCoordinator.java:143)atorg.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.resetToCheckpoint(OperatorCoordinatorHolder.java:284)atorg.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreStateToCoordinators(CheckpointCoordinator.java:2044)atorg.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1719)atorg.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateToAll(CheckpointCoordinator.java:1647)atorg.apache.flink.runtime.scheduler.SchedulerBase.restoreState(SchedulerBase.java:434)atorg.apache.flink.runtime.scheduler.DefaultScheduler.restartTasks(DefaultScheduler.java:419)atorg.apache.flink.runtime.scheduler.DefaultScheduler.lambda7(RecreateOnResetOperatorCoordinator.java:156) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) at org.apache.flink.runtime.operators.coordination.RecreateOnResetOperatorCoordinator.resetToCheckpoint(RecreateOnResetOperatorCoordinator.java:143) at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.resetToCheckpoint(OperatorCoordinatorHolder.java:284) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreStateToCoordinators(CheckpointCoordinator.java:2044) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1719) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateToAll(CheckpointCoordinator.java:1647) at org.apache.flink.runtime.scheduler.SchedulerBase.restoreState(SchedulerBase.java:434) at org.apache.flink.runtime.scheduler.DefaultScheduler.restartTasks(DefaultScheduler.java:419) at org.apache.flink.runtime.scheduler.DefaultScheduler.lambdanull2(DefaultScheduler.java:379)atjava.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:719)atjava.util.concurrent.CompletableFuture2(DefaultScheduler.java:379) at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:719) at java.util.concurrent.CompletableFutureUniRun.tryFire(CompletableFuture.java:701) at java.util.concurrent.CompletableFutureCompletion.run(CompletableFuture.java:456)atorg.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambdaCompletion.run(CompletableFuture.java:456) at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambdahandleRunAsync4(PekkoRpcActor.java:451)atorg.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)atorg.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451)atorg.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218)atorg.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:85)atorg.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168)atorg.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)atorg.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)atscala.PartialFunction.applyOrElse(PartialFunction.scala:127)atscala.PartialFunction.applyOrElse4(PekkoRpcActor.java:451) at org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451) at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218) at org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:85) at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168) at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) at scala.PartialFunction.applyOrElse(PartialFunction.scala:126) at org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29) at scala.PartialFunctionOrElse.applyOrElse(PartialFunction.scala:175)atscala.PartialFunctionOrElse.applyOrElse(PartialFunction.scala:175) at scala.PartialFunctionOrElse.applyOrElse(PartialFunction.scala:176) at scala.PartialFunctionOrElse.applyOrElse(PartialFunction.scala:176)atorg.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)atorg.apache.pekko.actor.Actor.aroundReceiveOrElse.applyOrElse(PartialFunction.scala:176) at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547) at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:545) at org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229) at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590) at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557) at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280) at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241) at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1067) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1703) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:172) Caused by: org.apache.flink.table.api.ValidationException: The MySQL server has a timezone offset (0 seconds ahead of UTC) which does not match the configured timezone Asia/Shanghai. Specify the right server-time-zone to avoid inconsistencies for time-related fields. at com.ververica.cdc.connectors.mysql.MySqlValidator.checkTimeZone(MySqlValidator.java:184) at com.ververica.cdc.connectors.mysql.MySqlValidator.validate(MySqlValidator.java:72) at com.ververica.cdc.connectors.mysql.source.MySqlSource.createEnumerator(MySqlSource.java:197) at org.apache.flink.runtime.source.coordinator.SourceCoordinator.start(SourceCoordinator.java:221) ... 42 common frames omitted

查看mysql: show variables like '%time_zone%'; 在这里插入图片描述 解决方案:

SET time_zone = 'Asia/Shanghai';
SET @@global.time_zone = 'Asia/Shanghai';
#再次查看
SELECT @@global.time_zone;
show variables like '%time_zone%';

在这里插入图片描述

参考资料

源码:https://github.com/ververica/flink-cdc-connectors 文档:https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html 官网:https://ververica.github.io/flink-cdc-connectors/