使用Canal-1.1.5实现ClickHouse实时同步MySQL数据(rabbitmq版)

3,430 阅读6分钟

ClickHouse作为OLAP分析引擎已经被广泛使用,数据的导入导出是用户面临的第一个问题。由于ClickHouse本身无法很好地支持单条大批量的写入,因此在实时同步数据方面需要借助其他服务协助。本文给出一种结合Canal+rabbit的方案,并且给出在多个MySQL实例分库分表的场景下,如何将多张MySQL数据表写入同一张ClickHouse表的方法,欢迎大家批评指正

之前讲述过使用Canal实现ClickHouse实时同步MySQL数据,但是发现有点小问题,只同步了一张表的数据,这次除修复这个bug后,有集成了rabbitmq来作为消息队列

上一篇 使用Canal实现ClickHouse实时同步MySQL数据

我的需求就是将mysql多张表的数据实时同步到clickhouse中,参考了很多网上的资料,写的都不是很好,有的配置完成后,就会报错,今天终于实现了该功能,分享给大家。

我主要利用canal来实现ClickHouse实时同步MySQL数据,并且使用了rabbitmq来做消息队列(canal-1.1.5支持了rabbitmq)

  • mysql配置不再赘述,看我的上一篇文章即可
  • rabbitmq配置 (rabbitmq自行安装)
  • canal安装配置
  • canal-client安装配置
  • 创建clickhouse的数据库和表
  • 结果展示

1. mysql配置不再赘述,看我的上一篇文章即可

上一篇 使用Canal实现ClickHouse实时同步MySQL数据

2. rabbitmq配置

新建交换机

image.png

新建队列

image.png

交换机和队列绑定

image.png

3. canal安装配置

下载 canal, 访问 release页面 , 选择需要的包下载, 如以 1.1.5 版本为例

#下载canalan安装包
wget https://github.com/alibaba/canal/releases/download/canal-1.1.5/canal.deployer-1.1.5.tar.gz

#解压到指定目录
mkdir -p /usr/tool/canal
tar zxvf  canal.deployer-1.1.5.tar.gz  -C  /usr/tool/canal

#解压后进入目录,结构如下
drwxr-xr-x   7 awwzc  staff   238 12 14 23:34 bin
drwxr-xr-x   9 awwzc  staff   306 12 14 23:32 conf
drwxr-xr-x  83 awwzc  staff  2822 12 14 23:30 lib
drwxr-xr-x   4 awwzc  staff   136 12 14 23:34 logs

#修改canal.properties(一定要修改)
canal.serverMode = rabbitmq
canal.mq.servers = xxx #mq的ip地址
canal.mq.vhost=/
canal.mq.exchange=ck-ex # 交换机
canal.mq.username=xxx
canal.mq.password=xxx
canal.mq.aliyunuid=


#修改example文件夹下面的instance.properties文件下面几项为你自己的数据库配置即可
vi conf/maxwell/instance.properties

# position info
canal.instance.master.address=192.168.0.102:3306
canal.instance.filter.regex=aqi_china\\.tb_aqi_.* #筛选mysql表
canal.mq.topic=example-routingkey #配置交换机和队列的routingkey

# 启动,安装目录下执行以下命令,server,instance出现下面日记说明启动成功
bin/startup.sh
# 查看server日记,会出现以下日记
tail -200f  logs/canal/canal.log

2019-12-14 23:34:47.247 [main] INFO  com.alibaba.otter.canal.deployer.CanalLauncher - ## set default uncaught exception handler
2019-12-14 23:34:47.312 [main] INFO  com.alibaba.otter.canal.deployer.CanalLauncher - ## load canal configurations
2019-12-14 23:34:47.334 [main] INFO  com.alibaba.otter.canal.deployer.CanalStarter - ## start the canal server.
2019-12-14 23:34:47.406 [main] INFO  com.alibaba.otter.canal.deployer.CanalController - ## start the canal server[192.168.0.111(192.168.0.111):11111]
2019-12-14 23:34:49.026 [main] INFO  com.alibaba.otter.canal.deployer.CanalStarter - ## the canal server is running now ......

# 查看instance日记,会出现以下日记
tail -200f  logs/maxwell/maxwell.log

2019-12-15 17:59:12.908 [destination = example , address = /192.168.0.102:3306 , EventParser] WARN  c.a.o.c.p.inbound.mysql.rds.RdsBinlogEventParserProxy - ---> begin to find start position, it will be long time for reset or first position
2019-12-15 17:59:12.913 [destination = example , address = /192.168.0.102:3306 , EventParser] WARN  c.a.o.c.p.inbound.mysql.rds.RdsBinlogEventParserProxy - prepare to find start position just last position
 {"identity":{"slaveId":-1,"sourceAddress":{"address":"192.168.0.102","port":3306}},"postion":{"gtid":"","included":false,"journalName":"bin.000002","position":249315,"serverId":1,"timestamp":1576282583000}}
2019-12-15 17:59:13.015 [destination = example , address = /192.168.0.102:3306 , EventParser] WARN  c.a.o.c.p.inbound.mysql.rds.RdsBinlogEventParserProxy - ---> find start position successfully, EntryPosition[included=false,journalName=bin.000002,position=249315,serverId=1,gtid=,timestamp=1576282583000] cost : 105ms , the next step is binlog dump


#关闭
sh bin/stop.sh

4. canal-client安装配置

下载 canal, 访问 release页面 , 选择需要的包下载, 如以 1.1.5 版本为例

#下载canalan安装包
wget https://github.com/alibaba/canal/releases/download/canal-1.1.5/canal.adapter-1.1.5.tar.gz

#解压到指定目录
mkdir -p /usr/tool/canal-client
tar zxvf  canal.adapter-1.1.5.tar.gz  -C  /usr/tool/canal-client

#解压后进入目录,结构如下
drwxr-xr-x 2 root root        95 Jun  3 14:26 bin
-rw-r--r-- 1 root root 197350026 Jun  1 18:40 canal.adapter-1.1.5.tar.gz
drwxrwxrwx 8 root root       143 Jun  4 16:57 conf
drwxr-xr-x 2 root root      4096 Jun  2 17:02 lib
drwxrwxrwx 3 root root        21 Jun  1 19:01 logs
drwxrwxrwx 2 root root      4096 Apr 19 16:16 plugin

#在application.yaml文件中修改模式 并且配置rabbitmq
canal.conf:
  mode: rabbitMQ #tcp kafka rocketMQ rabbitMQ (大小写无所谓哈)
  ...
  rabbitmq.host: xxx:5672
  rabbitmq.virtual.host: 
  rabbitmq.username: xxxx
  rabbitmq.password: xxxx
  rabbitmq.resource.ownerId:
  
canalAdapters:
  - instance: ck-queue (配置rabbitmq的队列)
    groups:
    - groupId: g1
      outerAdapters:
       - name: logger
       - name: rdb
         key: mysql1
         properties:
           jdbc.driverClassName: ru.yandex.clickhouse.ClickHouseDriver
           jdbc.url: clickhouse链接url
           jdbc.username: default
           jdbc.password:
           
# 进入rdb文件 
  针对多张表: 我这里有十张表 所以就要新建十个配置文件
  
# 源数据库与目标数据库名字不同,源表名与目标表名不同
  #修改adapter的conf/rdb/mytest_user.yml配置文件,指定源数据库和目标数据库
  dataSourceKey: defaultDS
  destination: maxwell
  groupId: g1
  outerAdapterKey: mysql1
  concurrent: true
  dbMapping:
    database: aqi_china
    table: tb_aqi_0 (这个是需要改变的地方 tb_aqi_1,tb_aqi_2.....)
    targetTable: tb_aqi
    #targetPk:
    #  id: id
    mapAll: true
    #targetColumns:
    #  id:
    #  name:
    #  role_id:
    #  c_time:
    #  test1:
    #etlCondition: "where c_time>={}"
    commitBatch: 3000 # 批量提交的大小
    
 # 多个源数据库表写入目的端的同一张表
   在conf/rdb 目录配置多个yml文件,分别指明不同的table名称。
   
 这里在需要强调一点:
 在lib目录下要导入几个包:
  clickhouse-jdbc-0.2.jar
  httpclient-4.5.5.jar
  httpcore-4.4.9.jar
  lz4-1.3.0.jar
  lz4-java-1.4.1.jar
  

启动

bin/startup.sh

#查看日记,出现以下日记说名启动成功
tail -200f logs/adapter/adapter.log
我在启动的时候出现了内存溢出的异常,可以通过修改startup.sh的配置:

启动后需要看日志,成功启动后会打印:

2019-12-15 13:19:40.457 [main] INFO  c.a.o.canal.adapter.launcher.loader.CanalAdapterService - ## start the canal client adapters.
2019-12-15 13:19:40.464 [main] INFO  c.a.otter.canal.client.adapter.support.ExtensionLoader - extension classpath dir: /Users/awwzc/Documents/my_soft/tool/canal-adpater/plugin
2019-12-15 13:19:40.542 [main] INFO  c.a.o.canal.adapter.launcher.loader.CanalAdapterLoader - Load canal adapter: logger succeed
2019-12-15 13:19:40.546 [main] INFO  c.a.otter.canal.client.adapter.rdb.config.ConfigLoader - ## Start loading rdb mapping config ...
2019-12-15 13:19:40.637 [main] INFO  c.a.otter.canal.client.adapter.rdb.config.ConfigLoader - ## Rdb mapping config loaded
2019-12-15 13:19:40.640 [main] ERROR com.alibaba.druid.pool.DruidDataSource - testWhileIdle is true, validationQuery not set
2019-12-15 13:19:40.951 [main] INFO  com.alibaba.druid.pool.DruidDataSource - {dataSource-2} inited
2019-12-15 13:19:40.959 [main] INFO  c.a.o.canal.adapter.launcher.loader.CanalAdapterLoader - Load canal adapter: rdb succeed
2019-12-15 13:19:40.986 [main] INFO  c.a.o.canal.adapter.launcher.loader.CanalAdapterLoader - Start adapter for canal instance: example succeed
2019-12-15 13:19:40.986 [Thread-4] INFO  c.a.o.canal.adapter.launcher.loader.CanalAdapterWorker - =============> Start to connect destination: example <=============
2019-12-15 13:19:40.986 [main] INFO  c.a.o.canal.adapter.launcher.loader.CanalAdapterService - ## the canal client adapters are running now ......
2019-12-15 13:19:40.995 [main] INFO  org.apache.coyote.http11.Http11NioProtocol - Starting ProtocolHandler ["http-nio-8081"]
2019-12-15 13:19:41.021 [main] INFO  org.apache.tomcat.util.net.NioSelectorPool - Using a shared selector for servlet write/read
2019-12-15 13:19:41.048 [main] INFO  o.s.boot.web.embedded.tomcat.TomcatWebServer - Tomcat started on port(s): 8081 (http) with context path ''
2019-12-15 13:19:41.053 [main] INFO  c.a.otter.canal.adapter.launcher.CanalAdapterApplication - Started CanalAdapterApplication in 7.099 seconds (JVM running for 7.832)
2019-12-15 13:19:41.122 [Thread-4] INFO  c.a.o.canal.adapter.launcher.loader.CanalAdapterWorker - =============> Start to subscribe destination: example <=============
2019-12-15 13:19:41.128 [Thread-4] INFO  c.a.o.canal.adapter.launcher.loader.CanalAdapterWorker - =============> Subscribe destination: example succeed <=============

4. clickhouse创建数据库和表

#创建数据库
CREATE DATABASE aqi_china;

#创建表
CREATE TABLE IF NOT EXISTS  aqi_china.tb_aqi (id Int32,uuid String,uid Int32,aqi Int32,url String,co String,dew String,h String,no2 String,o3 String,p String,pm10 String,pm25 String,so2 String,t String,w String,wg String,vtime String,ftime Int32) ENGINE=MergeTree() PARTITION BY uid ORDER BY id SETTINGS index_granularity = 8192;

5. 结果展示

我们来看一下展示结果:

这是canal客户端打印的日志:

展示clickhouse的数据:

数据已经同步成功了

希望可以帮助到跟我遇到同样问题的同学,这个方式肯定没问题,如果有问题,可以给我留言