数据同步
使用场景:mysql与hadoop(hive/hdfs)上的同步
1. Sqoop
一、安装部署:
- 下载sqoop 1.4.7,解压
- 配置环境变量
- 配置sqoop,conf的sqoop-env.sh,在其中配置Hadoop地址
- 1.4.7原生版本有问题,自己下载
commons-lang3-3.4.jar =》2.x maven下载依赖到lib中,还有mysql connector也添加
二、参数
--connect 指定mysql 地址
--username 指定用户名
--password 密码
--table 指定table名字
--target-dir 指定hdfs 路径
-m 指定map task的个数
--delete-target-dir 删除目标路径
sqoop import \
--connect jdbc:mysql://bigdata31:3306/hive \
--username root \
--password 123456 \
--table tbls \
--target-dir /data/tbls \
--delete-target-dir \
-m 1
补充:如果mysql表不带主键怎么办?
会报错
报错: ERROR tool.ImportTool: Import failed: No primary key could be found for table b. Please specify one with --split-by or perform a sequential import with '-m 1'.、
因为默认切分4个maptask,根据主键进行划分,如果没有主键,有两个解决办法:
- --split-by "column" 按指定字段划分
- -m 1 不划分,只分一个切片
--where 筛选数据
--columns 筛选字段
--query 指定抽取数据的sql
sqoop import \
--connect jdbc:mysql://bigdata31:3306/bigdata \
--username root \
--password 123456 \
--target-dir /data/b \
--delete-target-dir \
--split-by id \
-m 2 \
--query "select * from bigdata.b where \$CONDITIONS" \
--fields-terminated-by "\t"
三、mysql到hive
1. hive建表,然后sqoop同步数据到hdfs上hive表的路径下
非分区表
-
hive建表
CREATE TABLE
base_category2(idbigint,namevarchar(200),category1_idbigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; -
sqoop导入表中
sqoop import
--connect jdbc:mysql://hadoop102:3306/gmall
--username root
--password 000000
--target-dir /user/hive/warehouse/bigdata.db/base_category2
--delete-target-dir
-m 1
--query "select * from gmall.base_category2 where $CONDITIONS"
--fields-terminated-by "\t" -
hive语句查询
select * from base_category2
分区表
-
hive建表
CREATE TABLE
base_category2_p2(idbigint ,namevarchar(200) ) PARTITIONED BY (category1_idbigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; -
添加分区
alter table base_category2_p2 add partition(category1_id=1)
-
sqoop通过where导入分区
sqoop import
--connect jdbc:mysql://hadoop102:3306/gmall
--username root
--password 000000
--target-dir /user/hive/warehouse/bigdata.db/base_category2_p2/category1_id=1
--delete-target-dir
-m 1
--query "select * from gmall.base_category2 where category1_id = 1 and $CONDITIONS"
--fields-terminated-by "\t" -
hive语句查询
select * from base_category2_p2
2.sqoop hive(不好用)
四、hive到mysql
数据导出
sqoop export \
--connect jdbc:mysql://bigdata31:3306/bigdata \
--username root \
--password 123456 \
--table result \
--export-dir /user/hive/warehouse/bigdata.db/rpt_cnt_top3 \
--fields-terminated-by "\t"
但是如何保证幂等性?(如果导出到mysql后又导出一次,那么数据会多了一份,如果保持数据多次同步都只有一份呢)
- 用mysql(linux)提供的命令 删除指定的mysql table数据
- 先判断table指定数据存不存在,存在则删除,不存在则跳过
- 再sqoop同步
五、sqoop同步性脚本
需要考虑的问题:
- 幂等性
- 报警(程序报错)
- 同步任务(调度平台)
#!/bin/bash
if [ $# -ne 6 ];then
echo "usage: <mysql_username> <mysql_password> <hive_db> <hive_tb> <mysql_db> <mysql_tb>"
echo "mysql_username mysql用户名"
echo "mysql_password mysql密码"
echo "hive_db 数据源:hive库"
echo "hive_tb 数据源:hive表"
echo "mysql_db 导入到:mysql库"
echo "mysql_tb 导入到:mysql表"
exit 1
fi
mysql_username=$1
mysql_password=$2
hive_db=$3
hive_tb=$4
mysql_db=$5
mysql_tb=$6
hive_dir=/user/hive/warehouse
seq=","
count=$(mysql -u "${mysql_username}" -p"${mysql_password}" -N -s -e "select count(*) from ${mysql_tb}" "${mysql_db}")
cnt=$count
if [ $cnt -gt 0 ]; then
mysql -u "${mysql_username}" -p"${mysql_password}" -e "DELETE FROM ${mysql_tb}" "${mysql_db}"
echo "删除${mysql_db}.${mysql_tb}表成功"
fi
sqoop export \
--connect jdbc:mysql://$hadoop102:3306/${mysql_db} \
--username ${mysql_username} \
--password ${mysql_password} \
--table ${mysql_tb} \
--export-dir ${hive_dir}/${hive_db}.db/${hive_tb} \
--fields-terminated-by "${seq}"
exit_status=$?
if [ $exit_status -eq 0 ]; then
echo "sqoop抽取数据成功"
else
echo "抽取失败-----报警"
fi
六、性能问题
由于sqoop底层是走mapreduce,并且资源是去yarn上申请资源,所以性能问题一直是个瓶颈。
2. DataX
此json模板是读取hdfs数据到mysql中
通过shell命令替换json模板中的参数,然后执行json语句
一、json模板
{
"job": {
"setting": {
"speed": {
"channel": 3
}
},
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "hdfs_path",
"defaultFS": "hdfs_defaultFS",
"column": [
hdfs_column
],
"fileType": "hdfs_fileType",
"encoding": "UTF-8",
"fieldDelimiter": "hdfs_fieldDelimiter"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"writeMode": "mysql_writeMode",
"username": "mysql_username",
"password": "mysql_password",
"column": [
mysql_column
],
"preSql": [
"mysql_preSql"
],
"connection": [
{
"jdbcUrl": "mysql_jdbcUrl",
"table": [
"mysql_table"
]
}
]
}
}
}
]
}
}
二、shell脚本
#hdfs2mysql
#hdfs
hdfs_path=/user/hive/warehouse/bigdata.db/rpt_cnt_top3/*
hdfs_defaultFS=hdfs://bigdata31:9000
hdfs_column='{"index": 0,"type": "string"},{"index": 1,"type": "string"},{"index": 2,"type": "long"},{"index": 3,"type": "long"}'
hdfs_fileType=text
hdfs_fieldDelimiter="\\\t"
#mysql
mysql_db=bigdata
mysql_table=result
mysql_column='"area","product_name","cnt","rk"'
mysql_jdbcUrl=jdbc:mysql://bigdata31:3306/${mysql_db}
mysql_username=root
mysql_password=123456
mysql_preSql='delete from result'
mysql_writeMode=insert
#指定datax 操作的json文件path
json_file="/home/hadoop/project/datax/hdfs_${mysql_table}.json"
cat /home/hadoop/project/datax/hdfs2mysql.json > ${json_file}
##替换模板文件里面的参数
sed -i "s#hdfs_path#${hdfs_path}#g" ${json_file}
sed -i "s#hdfs_defaultFS#${hdfs_defaultFS}#g" ${json_file}
sed -i "s#hdfs_column#${hdfs_column}#g" ${json_file}
sed -i "s#hdfs_fileType#${hdfs_fileType}#g" ${json_file}
sed -i "s#hdfs_fieldDelimiter#${hdfs_fieldDelimiter}#g" ${json_file}
sed -i "s#mysql_jdbcUrl#${mysql_jdbcUrl}#g" ${json_file}
sed -i "s#mysql_username#${mysql_username}#g" ${json_file}
sed -i "s#mysql_password#${mysql_password}#g" ${json_file}
sed -i "s#mysql_table#${mysql_table}#g" ${json_file}
sed -i "s#mysql_column#${mysql_column}#g" ${json_file}
sed -i "s#mysql_preSql#${mysql_preSql}#g" ${json_file}
sed -i "s#mysql_writeMode#${mysql_writeMode}#g" ${json_file}
source ~/.bashrc
datax ${json_file}
datax是在~/.bashrc中配置了别名
alias datax='python /opt/module/DataX-master/target/datax/datax/bin/datax.py'
export datax