一、前言
Kafka的connect插件功能强大,经常用于实时数仓领域,下面是基于connect插件功能快速搭建实时数据迁移同步的一个小例子,0代码实现数据同步。
示例的目的是实现MariaDB数据库可以实时同步到PostgreSQL数据库。
二、源端
下载debezium-connector-mysql插件,地址:Debezium插件-1.9
发送如下POST请求:
{
"name": "mariadb-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "mariadb IP",
"database.port": "3306",
"database.user": "mariadb 用户",
"database.password": "mariadb 密码",
"database.server.id": "随机数字,比如184054",
"database.server.name": "acm",
"database.include.list": "acm",
"database.history.kafka.bootstrap.servers": "127.0.0.1:9092",
"database.history.kafka.topic": "schema-changes.acm",
"time.precision.mode": "connect"
}
}
三、目标端
Debezium只提供了源端的插件,下游端的插件就绕不开confluent这个商业公司了,其实也就是Kafka的创始人另起炉灶创业的公司,提供了很多官方和第三方的插件供开发者下载,下载地址:confluent插件库
其实confluent插件库里面也包含了Debezium的插件,都是可以下载的。
这里选择下载JDBC插件,如下:
下载完以后,就是正常部署了,部署可以参考我的上一篇文章:传送门
发送如下POST请求:
-- 示例1
{
"name": "test-xxx-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics.regex": "acm.acm.(.*)",
"connection.url": "jdbc:postgresql://IP:port/acm?user=xxx&password=xxx",
"auto.create": true,
"auto.evolve": true,
"insert.mode": "insert",
"transforms": "ExtractField,ReplaceField",
"transforms.ExtractField.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
"transforms.ExtractField.field": "after",
"dialect.name": "PostgreSqlDatabaseDialect",
"transforms.ReplaceField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.ReplaceField.blacklist": "operator,last_modified_by",
"db.timezone": "UTC"
}
}
-- 示例2
{
"name": "test-xxx-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics.regex": "acm.acm.(.*)",
"connection.url": "jdbc:postgresql://IP:port/acm?user=xxx&password=xxx",
"auto.create": true,
"auto.evolve": true,
"insert.mode": "insert",
"transforms": "dropPrefix,ExtractField,ReplaceField",
"transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex": "acm.(acm.*)",
"transforms.dropPrefix.replacement": "$1",
"transforms.ExtractField.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
"transforms.ExtractField.field": "after",
"dialect.name": "PostgreSqlDatabaseDialect",
"transforms.ReplaceField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.ReplaceField.blacklist": "operator,last_modified_by",
"db.timezone": "UTC"
}
}
-- 示例3
{
"name": "test-xxx-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics.regex": "acm.acm.(.*)",
"dialect.name": "PostgreSqlDatabaseDialect",
"connection.url": "jdbc:postgresql://IP:port/acm?user=xxx&password=xxx",
"auto.create": true,
"auto.evolve": true,
"insert.mode": "insert",
"transforms": "dropPrefix,ExtractField,ReplaceField",
"transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex": "acm.(acm.*)",
"transforms.dropPrefix.replacement": "$1",
"transforms.ExtractField.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
"transforms.ExtractField.field": "after",
"transforms.ReplaceField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.ReplaceField.blacklist": "operator,last_modified_by"
}
}
上面的步骤完成之后,都正常的话,mariadb数据库里面acm数据库就可以实时同步到postgresql数据库了。
如果有报错,需要查看kafka-connect日志。
四、自定义解析Kafka
直接基于插件的方式好处是显而易见的,0代码,简单配置一下,就可以使用了,但是对于源端数据库的事件支持得不是很丰富,我最终选择得方案是,目标端自己解析去同步。