基于Kafka插件搭建数据同步流

98 阅读2分钟

一、前言

Kafka的connect插件功能强大,经常用于实时数仓领域,下面是基于connect插件功能快速搭建实时数据迁移同步的一个小例子,0代码实现数据同步。

示例的目的是实现MariaDB数据库可以实时同步到PostgreSQL数据库。

二、源端

下载debezium-connector-mysql插件,地址:Debezium插件-1.9

发送如下POST请求:

{
  "name": "mariadb-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "mariadb IP",
    "database.port": "3306",
    "database.user": "mariadb 用户",
    "database.password": "mariadb 密码",
    "database.server.id": "随机数字,比如184054",
    "database.server.name": "acm",
    "database.include.list": "acm",
    "database.history.kafka.bootstrap.servers": "127.0.0.1:9092",
    "database.history.kafka.topic": "schema-changes.acm",
    "time.precision.mode": "connect"
  }
}

三、目标端

Debezium只提供了源端的插件,下游端的插件就绕不开confluent这个商业公司了,其实也就是Kafka的创始人另起炉灶创业的公司,提供了很多官方和第三方的插件供开发者下载,下载地址:confluent插件库

其实confluent插件库里面也包含了Debezium的插件,都是可以下载的。

这里选择下载JDBC插件,如下:

image.png

下载完以后,就是正常部署了,部署可以参考我的上一篇文章:传送门

发送如下POST请求:

-- 示例1
{
  "name": "test-xxx-sink",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
    "tasks.max": "1",
    "topics.regex": "acm.acm.(.*)",
    "connection.url": "jdbc:postgresql://IP:port/acm?user=xxx&password=xxx",
    "auto.create": true,
    "auto.evolve": true,
    "insert.mode": "insert",
    "transforms": "ExtractField,ReplaceField",
    "transforms.ExtractField.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
    "transforms.ExtractField.field": "after",
    "dialect.name": "PostgreSqlDatabaseDialect",
    "transforms.ReplaceField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
    "transforms.ReplaceField.blacklist": "operator,last_modified_by",
    "db.timezone": "UTC"
  }
}

-- 示例2
{
  "name": "test-xxx-sink",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
    "tasks.max": "1",
    "topics.regex": "acm.acm.(.*)",
    "connection.url": "jdbc:postgresql://IP:port/acm?user=xxx&password=xxx",
    "auto.create": true,
    "auto.evolve": true,
    "insert.mode": "insert",
    "transforms": "dropPrefix,ExtractField,ReplaceField",
    "transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.dropPrefix.regex": "acm.(acm.*)",
    "transforms.dropPrefix.replacement": "$1",
    "transforms.ExtractField.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
    "transforms.ExtractField.field": "after",
    "dialect.name": "PostgreSqlDatabaseDialect",
    "transforms.ReplaceField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
    "transforms.ReplaceField.blacklist": "operator,last_modified_by",
    "db.timezone": "UTC"
  }
}

-- 示例3
{
  "name": "test-xxx-sink",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
    "tasks.max": "1",
    "topics.regex": "acm.acm.(.*)",
    "dialect.name": "PostgreSqlDatabaseDialect",
    "connection.url": "jdbc:postgresql://IP:port/acm?user=xxx&password=xxx",
    "auto.create": true,
    "auto.evolve": true,
    "insert.mode": "insert",
    "transforms": "dropPrefix,ExtractField,ReplaceField",
    "transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.dropPrefix.regex": "acm.(acm.*)",
    "transforms.dropPrefix.replacement": "$1",
    "transforms.ExtractField.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
    "transforms.ExtractField.field": "after",
    "transforms.ReplaceField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
    "transforms.ReplaceField.blacklist": "operator,last_modified_by"
  }
}

上面的步骤完成之后,都正常的话,mariadb数据库里面acm数据库就可以实时同步到postgresql数据库了。

如果有报错,需要查看kafka-connect日志。

四、自定义解析Kafka

直接基于插件的方式好处是显而易见的,0代码,简单配置一下,就可以使用了,但是对于源端数据库的事件支持得不是很丰富,我最终选择得方案是,目标端自己解析去同步。

五、参考

kafka.apache.org/documentati…

medium.com/swlh/sync-m…