Logstash-Jdbc同步到 elasticsearch

mysql 作为成熟稳定的数据持久化解决方案，广泛地应用在各种领域，但是在数据分析方面稍有不足，而 elasticsearch 作为数据分析领域的佼佼者，刚好可以弥补这项不足，而我们要做的只需要将 mysql 中的数据同步到 elasticsearch 中即可，而 logstash 刚好就可以支持，所有你需要做的只是写一个配置文件而已。

获取Logstash-Jdbc

下载与Logstash相对应的版本号

Logstash-Jdbc插件安装

[root@elk-node1 tmp]# tar -zxvf logstash-input-jdbc-4.3.18.tar.gz
[root@elk-node1 tmp]# mv logstash-input-jdbc-4.3.18 /ELK/
[root@elk-node1 ELK]# cd  /ELK/logstash-input-jdbc-4.3.18
[root@elk-node1]# gem build logstash-input-jdbc.gemspec
[root@elk-node1 ELK]# cd /ELK/logstash/bin/
[root@elk-node1 bin]# ./logstash-plugin install /ELK/logstash-input-jdbc-4.3.18/logstash-input-jdbc-4.3.18.gem

Example config

input{
jdbc{
type => "jdbc"      #连接类型#连接类型

jdbc_driver_library => "/ELK/logstash/tools/mysql-connector-java-5.1.48.jar"    #jdbc驱动包

jdbc_driver_class => "com.mysql.jdbc.Driver"    # the name of the driver class for mysql

jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/DBname"  # 数据库连接地址

jdbc_user => "root"     # 数据库连接账号

jdbc_password => "123456"       # 数据库连接密码

schedule => "* * * * *"     #执行 sql 时机，类似 crontab 的调度

connection_retry_attempts => "3"        # 数据库重连尝试次数

jdbc_validate_connection => "true"      # 判断数据库连接是否可用，默认false不开启

jdbc_validation_timeout => "3600"	    # 数据库连接可用校验超时时间，默认3600S

jdbc_paging_enabled => "true"           # 开启分页查询（默认false不开启）

jdbc_page_size => "500"                 #单次分页查询条数（默认100000,若字段较多且更新频率较高，建议调低此值）

statement => "select * from log_company_operation "WHERE platform>= :sql_last_value order by platform asc
# statement为查询数据sql，如果sql较复杂，建议配通过statement_filepath配置sql文件的存放路径；
# sql_last_value为内置的变量，存放上次查询结果中最后一条数据tracking_column的值，此处即为log_company_operat；
# statement_filepath => "mysql/jdbc.sql"

lowercase_column_names => false     #是否将字段名转换为小写，默认true（如果有数据序列化、反序列化需求，建议改为false）；

sql_log_level => warn       # Value can be any of: fatal,error,warn,info,debug，默认info；

record_last_run => true     # 是否记录上次执行结果，true表示会将上次执行结果的tracking_column字段的值保存到last_run_metadata_path指定的文件中

use_column_value => true        # 需要记录查询结果某字段的值时，此字段为true，否则默认tracking_column为timestamp的值；

tracking_column => "ModifyTime"     # 需要记录的字段，用于增量同步，需是数据库字段

tracking_column_type => timestamp       # Value can be any of: numeric,timestamp，Default value is "numeric"

last_run_metadata_path => "mysql/last_id.txt"       # record_last_run上次数据存放位置；

clean_run => false      # 是否清除last_run_metadata_path的记录，需要增量同步时此字段必须为false；

use_column_value => true
tracking_column_type => "timestamp"
tracking_column => "update_time"
use_column_value: 使用递增列的值
tracking_column_type: 递增字段的类型，numeric 表示数值类型, timestamp 表示时间戳类型
tracking_column: 递增字段的名称，这里使用 update_time 这一列，这列的类型是 timestamp
 }
}

output {
    elasticsearch {
        hosts => ["172.18.12.172:9200"]
        index => "jdbc-%{+YYYY.MM.dd}"
        document_id => "%{id}"
        document_type => "user"
    }
    stdout {
        codec => json_lines
    }
}

hosts: es 集群地址
user: es 用户名
password: es 密码
index: 导入到 es 中的 index 名，这里我直接设置成了 mysql 表的名字
document_id: 导入到 es 中的文档 id，这个需要设置成主键，否则同一条记录更新后在 es 中会出现两条记录，%{id} 表示引用 mysql 表中 id 字段的值

多表同步的两种方式：

一、创建多个XXX.conf文件配置pipelines.yml

[root@elk-node1 config]# vim /ELK/logstash/config/pipelines.yml

配置如下：

- pipeline.id: table1
  path.config: "/XXX/XX/X/mysql_user.conf"
- pipeline.id: table2
  path.config: "/XXX/XX/X/mysql_member.conf"

二、通过一个.conf文件同步

input {
	stdin {}
	jdbc {
		 # 多表同步时，表类型区分，建议命名为“库名_表名”，每个jdbc模块需对应一个type；
		type => "TestDB_DetailTab"
		
		 # 其他配置此处省略，参考单表配置
		 # ...
		 # ...
		 # record_last_run上次数据存放位置；
		last_run_metadata_path => "mysql\last_id.txt"
		 # 是否清除last_run_metadata_path的记录，需要增量同步时此字段必须为false；
		clean_run => false
		 #
		 # 同步频率(分 时 天 月 年)，默认每分钟同步一次；
		schedule => "* * * * *"
	}
	jdbc {
		 # 多表同步时，表类型区分，建议命名为“库名_表名”，每个jdbc模块需对应一个type；
		type => "TestDB_Tab2"
		# 多表同步时，last_run_metadata_path配置的路径应不一致，避免有影响；
		 # 其他配置此处省略
		 # ...
		 # ...
	}
}
 
 
output {
	# output模块的type需和jdbc模块的type一致
	if [type] == "TestDB_DetailTab" {
		elasticsearch {
			 # host => "192.168.1.1"
			 # port => "9200"
			 # 配置ES集群地址
			hosts => ["192.168.1.1:9200", "192.168.1.2:9200", "192.168.1.3:9200"]
			 # 索引名字，必须小写
			index => "detailtab1"
			 # 数据唯一索引（建议使用数据库KeyID）
			document_id => "%{KeyId}"
		}
	}
	if [type] == "TestDB_Tab2" {
		elasticsearch {
			# host => "192.168.1.1"
			# port => "9200"
			# 配置ES集群地址
			hosts => ["192.168.1.1:9200", "192.168.1.2:9200", "192.168.1.3:9200"]
			# 索引名字，必须小写
			index => "detailtab2"
			# 数据唯一索引（建议使用数据库KeyID）
			document_id => "%{KeyId}"
		}
	}
	stdout {
		codec => json_lines
	}
}

常见问题及解决方案

一、数据同步后，ES没有数据

output.elasticsearch模块的index必须是全小写

二、增量同步后last_run_metadata_path文件内容不改变

如果lowercase_column_names配置的不是false，那么tracking_column字段配置的必须是全小写。

三、提示找不到jdbc_driver_library

2032 com.mysql.jdbc.Driver not loaded. Are you sure you've included the correct jdbc driver in:jdbc_driver_library?

检测配置的地址是否正确，如果是linux环境，注意路径分隔符是“/”，而不是“\”。
检测插件版本与软件版本是否相对应

四、数据丢失

statement配置的sql中，如果比较字段使用的是大于“>”，可能存在数据丢失。

假设当同步完成后last_run_metadata_path存放的时间为2019-01-30 20:45:30，而这时候新入库一条数据的更新时间也为2019-01-30 20:45:30，那么这条数据将无法同步。

五、数据重复更新

上一个问题“数据丢失”提供的解决方案是比较字段使用“大于等于”，但这时又会产生新的问题。假设当同步完成后last_run_metadata_path存放的时间为2019-01-30 20:45:30，而数据库中更新时间最大值也为2019-01-30 20:45:30，那么这些数据将重复更新，直到有更新时间更大的数据出现。当上述特殊数据很多，且长期没有新的数据更新时，会导致大量的数据重复同步到ES。何时会出现以上情况呢：①比较字段非"自增"；②比较字段是程序生成插入。

解决方案：
1、比较字段自增保证不重复或重复概率极小（比如使用自增ID或者数据库的timestamp），这样就能避免大部分异常情况了；

2、如果确实存在大量程序插入的数据，其更新时间相同，且可能长期无数据更新，可考虑定期更新数据库中的一条测试数据，避免最大值有大量数据。

六、LogStash 错误：Logstash could not be started because there is already another instance usin

错误提示：

    Sending Logstash logs to /usr/local/logstash/logstash-6.5.0/logs which is now configured via log4j2.properties[2018-11-20T12:23:45,931][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified[2018-11-20T12:23:46,088][FATAL][logstash.runner          ] Logstash could not be started because there is already another instance using the configured data directory.  If you wish to run multiple instances, you must change the "path.data" setting.[2018-11-20T12:23:46,130][ERROR][org.logstash.Logstash    ] java.lang.IllegalStateException: Logstash stopped processing because of an error: (SystemExit) exit

原因：之前运行的instance有缓冲，保存在path.data里面有.lock文件，删除掉就可以。

解决方式：

在 logstash.yml 文件中找到 Data path 的路径(默认在安装目录的data目录下）
查看是否存在 .lock 文件，如果存在把它删除

[root@elk-node1]# ls -alh
总用量 20K
drwxr-sr-x  4 tingshuo staff 4.0K 11月 20 11:42 .
drwxr-sr-x 14 tingshuo staff 4.0K 11月 20 11:42 ..
drwxr-sr-x  2 tingshuo staff 4.0K 11月 20 11:42 dead_letter_queue-rw-r--r--  1 tingshuo staff    0 11月 20 11:42  .lock
drwxr-sr-x  2 tingshuo staff 4.0K 11月 20 11:42 queue-rw-r--r--              1 tingshuo staff   36 11月 20 11:42  uuid
[root@elk-node1]# rm  -f .lock

使用Logstash-input-jdbc插件抽取Mysql库中的数据到ELK中