Flink CDC 初步了解

1,291 阅读2分钟

Flink CDC

一些参考文档

源码层面深入

  1. flink调用代码示例如下:
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import com.alibaba.ververica.cdc.debezium.StringDebeziumDeserializationSchema;
import com.alibaba.ververica.cdc.connectors.mysql.MySQLSource;

public class MySqlBinlogSourceExample {
  public static void main(String[] args) throws Exception {
    SourceFunction<String> sourceFunction = MySQLSource.<String>builder()
      .hostname("localhost")
      .port(3306)
      .databaseList("inventory") // monitor all tables under inventory database
      .username("flinkuser")
      .password("flinkpw")
      .deserializer(new StringDebeziumDeserializationSchema()) // converts SourceRecord to String
      .build();

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    env
      .addSource(sourceFunction)
      .print().setParallelism(1); // use parallelism 1 for sink to keep message ordering

    env.execute();
  }
}
  1. flink-cdc-connectors这个项目的Mysql读取相关的UML类图,整理如下:

flink_cdc_MySQLSource.png

  • 上步骤的重点为一个Mysql的source类,即com.alibaba.ververica.cdc.connectors.mysql.MySQLSource; 此为一个builder,进行相关参数和启动模式的梳理,最后创建一个debezium读取的类: com.alibaba.ververica.cdc.debezium.DebeziumSourceFunction,此为flink的一个SourceFunction, 由它进行snapshot和增量binlog的读取。 该类的代码描述:
/**
 * The {@link DebeziumSourceFunction} is a streaming data source that pulls captured change data
 * from databases into Flink.
 *
 * <p>There are two workers during the runtime. One worker periodically pulls records from the
 * database and pushes the records into the {@link Handover}. The other worker consumes the records
 * from the {@link Handover} and convert the records to the data in Flink style. The reason why
 * don't use one workers is because debezium has different behaviours in snapshot phase and
 * streaming phase.
 *
 * <p>Here we use the {@link Handover} as the buffer to submit data from the producer to the
 * consumer. Because the two threads don't communicate to each other directly, the error reporting
 * also relies on {@link Handover}. When the engine gets errors, the engine uses the {@link
 * DebeziumEngine.CompletionCallback} to report errors to the {@link Handover} and wakes up the
 * consumer to check the error. However, the source function just closes the engine and wakes up the
 * producer if the error is from the Flink side.
 *
 * <p>If the execution is canceled or finish(only snapshot phase), the exit logic is as same as the
 * logic in the error reporting.
 *
 * <p>The source function participates in checkpointing and guarantees that no data is lost during a
 * failure, and that the computation processes elements "exactly once".
 *
 * <p>Note: currently, the source function can't run in multiple parallel instances.
 *
 * <p>Please refer to Debezium's documentation for the available configuration properties:
 * https://debezium.io/documentation/reference/1.2/development/engine.html#engine-properties
 */
  1. CDC具体调用流程图汇总

flink_mysql_cdc.png

初步结论

  • 目前由于使用debezium server进行数据同步,目前只支持单并发;多并发的实现issue中反馈正在开发中,待新版本确认;
  • 省去了kafka和debezium的部署,整体架构较简单;
  • 如果现架构既有kafka部署,而且希望中间缓存解耦,或者需要做多topic多分区以提高并发度的话;目前还是得保留kafka。