数据同步(Sqoop&DataX)

298 阅读2分钟

数据同步

使用场景:mysql与hadoop(hive/hdfs)上的同步

1. Sqoop

一、安装部署:

  1. 下载sqoop 1.4.7,解压
  2. 配置环境变量
  3. 配置sqoop,conf的sqoop-env.sh,在其中配置Hadoop地址
  4. 1.4.7原生版本有问题,自己下载
    commons-lang3-3.4.jar =》2.x maven下载依赖到lib中,还有mysql connector也添加

二、参数

--connect 指定mysql 地址
--username 指定用户名
--password 密码
--table 指定table名字
--target-dir 指定hdfs 路径
-m 指定map task的个数
--delete-target-dir 删除目标路径

sqoop import \
--connect jdbc:mysql://bigdata31:3306/hive \
--username root \
--password 123456 \
--table tbls \
--target-dir /data/tbls \
--delete-target-dir \
-m 1 

补充:如果mysql表不带主键怎么办?

会报错

报错: ERROR tool.ImportTool: Import failed: No primary key could be found for table b. Please specify one with --split-by or perform a sequential import with '-m 1'.、

因为默认切分4个maptask,根据主键进行划分,如果没有主键,有两个解决办法:

  1. --split-by "column" 按指定字段划分
  2. -m 1 不划分,只分一个切片

--where 筛选数据
--columns 筛选字段
--query 指定抽取数据的sql

sqoop import \
--connect jdbc:mysql://bigdata31:3306/bigdata \
--username root \
--password 123456 \
--target-dir /data/b \
--delete-target-dir \
--split-by id \
-m 2 \
--query "select *  from bigdata.b where \$CONDITIONS" \
--fields-terminated-by "\t"

三、mysql到hive

1. hive建表,然后sqoop同步数据到hdfs上hive表的路径下

非分区表

  • hive建表

    CREATE TABLE base_category2 ( id bigint, name varchar(200), category1_id bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

  • sqoop导入表中

    sqoop import
    --connect jdbc:mysql://hadoop102:3306/gmall
    --username root
    --password 000000
    --target-dir /user/hive/warehouse/bigdata.db/base_category2
    --delete-target-dir
    -m 1
    --query "select * from gmall.base_category2 where $CONDITIONS"
    --fields-terminated-by "\t"

  • hive语句查询

    select * from base_category2

分区表

  • hive建表

    CREATE TABLE base_category2_p2 ( id bigint , name varchar(200) ) PARTITIONED BY (category1_id bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

  • 添加分区

    alter table base_category2_p2 add partition(category1_id=1)

  • sqoop通过where导入分区

    sqoop import
    --connect jdbc:mysql://hadoop102:3306/gmall
    --username root
    --password 000000
    --target-dir /user/hive/warehouse/bigdata.db/base_category2_p2/category1_id=1
    --delete-target-dir
    -m 1
    --query "select * from gmall.base_category2 where category1_id = 1 and $CONDITIONS"
    --fields-terminated-by "\t"

  • hive语句查询

    select * from base_category2_p2

2.sqoop hive(不好用)

四、hive到mysql

数据导出

sqoop export \
--connect jdbc:mysql://bigdata31:3306/bigdata \
--username root \
--password 123456 \
--table result \
--export-dir /user/hive/warehouse/bigdata.db/rpt_cnt_top3 \
--fields-terminated-by "\t"

但是如何保证幂等性?(如果导出到mysql后又导出一次,那么数据会多了一份,如果保持数据多次同步都只有一份呢)

  1. 用mysql(linux)提供的命令 删除指定的mysql table数据
    • 先判断table指定数据存不存在,存在则删除,不存在则跳过
  2. 再sqoop同步

五、sqoop同步性脚本

需要考虑的问题:

  • 幂等性
  • 报警(程序报错)
  • 同步任务(调度平台)
#!/bin/bash
if [ $# -ne 6 ];then
    echo "usage: <mysql_username> <mysql_password> <hive_db> <hive_tb> <mysql_db> <mysql_tb>"
    echo "mysql_username    mysql用户名"
    echo "mysql_password    mysql密码"
    echo "hive_db    数据源:hive库"
    echo "hive_tb    数据源:hive表"
    echo "mysql_db   导入到:mysql库"
    echo "mysql_tb   导入到:mysql表"
    exit 1
fi

mysql_username=$1
mysql_password=$2
hive_db=$3
hive_tb=$4
mysql_db=$5
mysql_tb=$6
hive_dir=/user/hive/warehouse
seq=","

count=$(mysql -u "${mysql_username}" -p"${mysql_password}" -N -s -e "select count(*) from ${mysql_tb}" "${mysql_db}")
cnt=$count

if [ $cnt -gt 0 ]; then
    mysql -u "${mysql_username}" -p"${mysql_password}" -e "DELETE FROM ${mysql_tb}" "${mysql_db}"
    echo "删除${mysql_db}.${mysql_tb}表成功"
fi


sqoop export \
--connect jdbc:mysql://$hadoop102:3306/${mysql_db} \
--username ${mysql_username} \
--password ${mysql_password} \
--table ${mysql_tb} \
--export-dir ${hive_dir}/${hive_db}.db/${hive_tb} \
--fields-terminated-by "${seq}"

exit_status=$?
if [ $exit_status -eq 0 ]; then
    echo "sqoop抽取数据成功"
else
    echo "抽取失败-----报警"
fi

六、性能问题

由于sqoop底层是走mapreduce,并且资源是去yarn上申请资源,所以性能问题一直是个瓶颈。

2. DataX

此json模板是读取hdfs数据到mysql中
通过shell命令替换json模板中的参数,然后执行json语句

一、json模板

{
	"job": {
		"setting": {
			"speed": {
				"channel": 3
			}
		},
		"content": [
			{
				"reader": {
					"name": "hdfsreader",
					"parameter": {
						"path": "hdfs_path",
						"defaultFS": "hdfs_defaultFS",
						"column": [
							hdfs_column
						],
						"fileType": "hdfs_fileType",
						"encoding": "UTF-8",
						"fieldDelimiter": "hdfs_fieldDelimiter"
					}
				},
				"writer": {
					"name": "mysqlwriter",
					"parameter": {
						"writeMode": "mysql_writeMode",
						"username": "mysql_username",
						"password": "mysql_password",
						"column": [
							mysql_column
						],
						"preSql": [
                            "mysql_preSql"
                        ],
						"connection": [
							{
								"jdbcUrl": "mysql_jdbcUrl",
								"table": [
									"mysql_table"
								]
							}
						]
					}
				}
			}
		]
	}
}

二、shell脚本

#hdfs2mysql 
#hdfs
hdfs_path=/user/hive/warehouse/bigdata.db/rpt_cnt_top3/*
hdfs_defaultFS=hdfs://bigdata31:9000
hdfs_column='{"index": 0,"type": "string"},{"index": 1,"type": "string"},{"index": 2,"type": "long"},{"index": 3,"type": "long"}'
hdfs_fileType=text
hdfs_fieldDelimiter="\\\t"
#mysql
mysql_db=bigdata
mysql_table=result
mysql_column='"area","product_name","cnt","rk"'
mysql_jdbcUrl=jdbc:mysql://bigdata31:3306/${mysql_db}
mysql_username=root
mysql_password=123456
mysql_preSql='delete from result'
mysql_writeMode=insert

#指定datax 操作的json文件path
json_file="/home/hadoop/project/datax/hdfs_${mysql_table}.json"
cat /home/hadoop/project/datax/hdfs2mysql.json > ${json_file}

##替换模板文件里面的参数
sed -i "s#hdfs_path#${hdfs_path}#g" ${json_file}
sed -i "s#hdfs_defaultFS#${hdfs_defaultFS}#g" ${json_file}
sed -i "s#hdfs_column#${hdfs_column}#g" ${json_file}
sed -i "s#hdfs_fileType#${hdfs_fileType}#g" ${json_file}
sed -i "s#hdfs_fieldDelimiter#${hdfs_fieldDelimiter}#g" ${json_file}
sed -i "s#mysql_jdbcUrl#${mysql_jdbcUrl}#g" ${json_file}
sed -i "s#mysql_username#${mysql_username}#g" ${json_file}
sed -i "s#mysql_password#${mysql_password}#g" ${json_file}
sed -i "s#mysql_table#${mysql_table}#g" ${json_file}
sed -i "s#mysql_column#${mysql_column}#g" ${json_file}
sed -i "s#mysql_preSql#${mysql_preSql}#g" ${json_file}
sed -i "s#mysql_writeMode#${mysql_writeMode}#g" ${json_file}

source ~/.bashrc
datax ${json_file}

datax是在~/.bashrc中配置了别名

alias datax='python /opt/module/DataX-master/target/datax/datax/bin/datax.py'
export datax