Flink CDC 初步了解Flink CDC 相关内容记录分析文档参考 https://blog.csdn.net/

Flink CDC

一些参考文档

源码层面深入

flink调用代码示例如下:

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import com.alibaba.ververica.cdc.debezium.StringDebeziumDeserializationSchema;
import com.alibaba.ververica.cdc.connectors.mysql.MySQLSource;

public class MySqlBinlogSourceExample {
  public static void main(String[] args) throws Exception {
    SourceFunction<String> sourceFunction = MySQLSource.<String>builder()
      .hostname("localhost")
      .port(3306)
      .databaseList("inventory") // monitor all tables under inventory database
      .username("flinkuser")
      .password("flinkpw")
      .deserializer(new StringDebeziumDeserializationSchema()) // converts SourceRecord to String
      .build();

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    env
      .addSource(sourceFunction)
      .print().setParallelism(1); // use parallelism 1 for sink to keep message ordering

    env.execute();
  }
}

flink-cdc-connectors这个项目的Mysql读取相关的UML类图，整理如下:

上步骤的重点为一个Mysql的source类，即com.alibaba.ververica.cdc.connectors.mysql.MySQLSource；此为一个builder，进行相关参数和启动模式的梳理，最后创建一个debezium读取的类： com.alibaba.ververica.cdc.debezium.DebeziumSourceFunction，此为flink的一个SourceFunction，由它进行snapshot和增量binlog的读取。该类的代码描述:

/**
 * The {@link DebeziumSourceFunction} is a streaming data source that pulls captured change data
 * from databases into Flink.
 *
 * <p>There are two workers during the runtime. One worker periodically pulls records from the
 * database and pushes the records into the {@link Handover}. The other worker consumes the records
 * from the {@link Handover} and convert the records to the data in Flink style. The reason why
 * don't use one workers is because debezium has different behaviours in snapshot phase and
 * streaming phase.
 *
 * <p>Here we use the {@link Handover} as the buffer to submit data from the producer to the
 * consumer. Because the two threads don't communicate to each other directly, the error reporting
 * also relies on {@link Handover}. When the engine gets errors, the engine uses the {@link
 * DebeziumEngine.CompletionCallback} to report errors to the {@link Handover} and wakes up the
 * consumer to check the error. However, the source function just closes the engine and wakes up the
 * producer if the error is from the Flink side.
 *
 * <p>If the execution is canceled or finish(only snapshot phase), the exit logic is as same as the
 * logic in the error reporting.
 *
 * <p>The source function participates in checkpointing and guarantees that no data is lost during a
 * failure, and that the computation processes elements "exactly once".
 *
 * <p>Note: currently, the source function can't run in multiple parallel instances.
 *
 * <p>Please refer to Debezium's documentation for the available configuration properties:
 * https://debezium.io/documentation/reference/1.2/development/engine.html#engine-properties
 */

CDC具体调用流程图汇总

初步结论

目前由于使用debezium server进行数据同步，目前只支持单并发；多并发的实现issue中反馈正在开发中，待新版本确认；
省去了kafka和debezium的部署，整体架构较简单；
如果现架构既有kafka部署，而且希望中间缓存解耦，或者需要做多topic多分区以提高并发度的话；目前还是得保留kafka。