使用 DataX 将 Hive 与 MySQL 中的表互导

1,769 阅读6分钟

使用 DataX 将 Hive 与 MySQL 中的表互导

一、DataX 简介(摘自 github.com/alibaba/Dat… 详细介绍 Quick StartSupport Data Channels 二、......

DataX

DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具 / 平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。

Features

DataX 本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的 Reader 插件,以及向目标端写入数据的 Writer 插件,理论上 DataX 框架可以支持任意数据源类型的数据同步工作。同时 DataX 插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。

DataX 详细介绍

请参考:DataX-Introduction

Quick Start

Download DataX 下载地址

请点击:Quick Start

Support Data Channels

DataX 目前已经有了比较全面的插件体系,主流的 RDBMS 数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,详情请点击:DataX 数据源参考指南

类型数据源Reader(读)Writer(写)文档
RDBMS 关系型数据库MySQL
Oracle
SQLServer
PostgreSQL
DRDS
通用 RDBMS(支持所有关系型数据库)
阿里云数仓数据存储ODPS
ADS
OSS
OCS
NoSQL 数据存储OTS
Hbase0.94
Hbase1.1
MongoDB
Hive
无结构化数据存储TxtFile
FTP
HDFS
Elasticsearch

2.1 将 Hive 表导入 MySQL

1.将DataX下载后上传到服务器
2.解压后使用自检脚本:
python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json
3.运行没有问题后根据模板和自己的需求创建your_job.json配置文件,这里以hive2mysql为例:
(需要注意的是读hive表其实就是读hive存放在hdfs上的文件)
{
    "job": {
        "setting": {
            "speed": {
                "channel": 3
            }
        },
        "content": [
            {
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "path": "/apps/hive/warehouse/test_table/*",
                        "defaultFS": "hdfs://192.168.10.101:8020",
                        "column": [
                               {
                                "index": 0,
                                "type": "string"
                               },
                               {
                                "index": 1,
                                "type": "long"
                               }
                        ],
                        "fileType": "text",
                        "encoding": "UTF-8",
                        "fieldDelimiter": "\u0001"
                    }

                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "writeMode": "insert",
                        "username": "username",
                        "password": "password",
                        "column": [
                            "word",
                            "cnt"
                        ],
                        "session": [
                        	"set session sql_mode='ANSI'"
                        ],
                        "preSql": [
                            "delete from test_table"
                        ],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://192.168.10.116:3306/test_datax?useUnicode=true&characterEncoding=gbk",
                                "table": [
                                    "test_table"
                                ]
                            }
                        ]
                    }
                }
            }
        ]
    }
}
4.运行脚本
[root@Master datax]# python ./bin/datax.py ./job/hive2mysql.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2018-08-29 23:19:46.674 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-08-29 23:19:46.691 [main] INFO  Engine - the machine info  => 

	osInfo:	Oracle Corporation 1.8 25.161-b12
	jvmInfo:	Linux amd64 2.6.32-696.23.1.el6.x86_64
	cpu num:	1

	totalPhysicalMemory:	-0.00G
	freePhysicalMemory:	-0.00G
	maxFileDescriptorCount:	-1
	currentOpenFileDescriptorCount:	-1

	GC Names	[Copy, MarkSweepCompact]

	MEMORY_NAME                    | allocation_size                | init_size                      
	Eden Space                     | 273.06MB                       | 273.06MB                       
	Code Cache                     | 240.00MB                       | 2.44MB                         
	Survivor Space                 | 34.13MB                        | 34.13MB                        
	Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
	Metaspace                      | -0.00MB                        | 0.00MB                         
	Tenured Gen                    | 682.69MB                       | 682.69MB                       


2018-08-29 23:19:46.732 [main] INFO  Engine - 
...
2018-08-29 23:20:00.489 [job-0] INFO  JobContainer - 
	 [total cpu info] => 
		averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
		-1.00%                         | -1.00%                         | -1.00%
                        

	 [total gc info] => 
		 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
		 Copy                 | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
		 MarkSweepCompact     | 1                  | 1                  | 1                  | 0.071s             | 0.071s             | 0.071s             

2018-08-29 23:20:00.490 [job-0] INFO  JobContainer - PerfTrace not enable!
2018-08-29 23:20:00.499 [job-0] INFO  StandAloneJobContainerCommunicator - Total 939 records, 31267 bytes | Speed 3.05KB/s, 93 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.001s |  All Task WaitReaderTime 0.285s | Percentage 100.00%
2018-08-29 23:20:00.500 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2018-08-29 23:19:46
任务结束时刻                    : 2018-08-29 23:20:00
任务总计耗时                    :                 13s
任务平均流量                    :            3.05KB/s
记录写入速度                    :             93rec/s
读出记录总数                    :                 939
读写失败总数                    :                   0
**特别注意:hive存在hdfs上的数据默认是以'\001'分隔的,如果用vim打开文件会看到是以^A分隔,但分隔符要用"fieldDelimiter": "\u0001"!!!**

2.2 将 MySQL 表导入 Hive

1.在Hive中建表(储存为文本文件类型)
hive> create table mysql_table(word string, cnt int) row format delimited fields terminated by ',' STORED AS TEXTFILE;
OK
Time taken: 0.194 seconds
hive> select * from mysql_table limit 10;
OK
Time taken: 0.162 seconds
2.编写mysql2hive.json配置文件
{
    "job": {
        "setting": {
            "speed": {
                 "channel": 3
            },
            "errorLimit": {
                "record": 0,
                "percentage": 0.02
            }
        },
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "writeMode": "insert",
                        "username": "root",
                        "password": "123456",
                        "column": [
                            "word",
                            "cnt"
                        ],
                        "splitPk": "cnt",
                        "connection": [
                            {
                                "table": [
                                    "test_table"
                                ],
                                "jdbcUrl": [
     "jdbc:mysql://192.168.10.116:3306/test_datax"
                                ]
                            }
                        ]
                    }
                },
               "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://192.168.10.101:8020",
                        "fileType": "text",
                        "path": "/apps/hive/warehouse/mysql_table",
                        "fileName": "mysql_table",
                        "column": [
                            {
                                "name": "word",
                                "type": "string"
                            },
                            {
                                "name": "cnt",
                                "type": "int"
                            }
                        ],
                        "writeMode": "append",
                        "fieldDelimiter": ",",
                        "compress":"gzip"
                    }
                }
            }
        ]
    }
}
3.运行脚本
[root@Master datax]# python ./bin/datax.py ./job/mysql2hive.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2018-08-30 01:03:36.399 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-08-30 01:03:36.415 [main] INFO  Engine - the machine info  => 

	osInfo:	Oracle Corporation 1.8 25.161-b12
	jvmInfo:	Linux amd64 2.6.32-696.23.1.el6.x86_64
	cpu num:	1

	totalPhysicalMemory:	-0.00G
	freePhysicalMemory:	-0.00G
	maxFileDescriptorCount:	-1
	currentOpenFileDescriptorCount:	-1

	GC Names	[Copy, MarkSweepCompact]

	MEMORY_NAME                    | allocation_size                | init_size                      
	Eden Space                     | 273.06MB                       | 273.06MB                       
	Code Cache                     | 240.00MB                       | 2.44MB                         
	Survivor Space                 | 34.13MB                        | 34.13MB                        
	Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
	Metaspace                      | -0.00MB                        | 0.00MB                         
	Tenured Gen                    | 682.69MB                       | 682.69MB                       


2018-08-30 01:03:36.469 [main] INFO  Engine - 
...
2018-08-30 01:03:52.094 [job-0] INFO  JobContainer - 
	 [total cpu info] => 
		averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
		-1.00%                         | -1.00%                         | -1.00%
                        

	 [total gc info] => 
		 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
		 Copy                 | 1                  | 1                  | 1                  | 0.346s             | 0.346s             | 0.346s             
		 MarkSweepCompact     | 1                  | 1                  | 1                  | 0.206s             | 0.206s             | 0.206s             

2018-08-30 01:03:52.094 [job-0] INFO  JobContainer - PerfTrace not enable!
2018-08-30 01:03:52.096 [job-0] INFO  StandAloneJobContainerCommunicator - Total 939 records, 19462 bytes | Speed 1.90KB/s, 93 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.975s |  All Task WaitReaderTime 0.022s | Percentage 100.00%
2018-08-30 01:03:52.101 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2018-08-30 01:03:36
任务结束时刻                    : 2018-08-30 01:03:52
任务总计耗时                    :                 15s
任务平均流量                    :            1.90KB/s
记录写入速度                    :             93rec/s
读出记录总数                    :                 939
读写失败总数                    :                   0
4.查看hive表中是否有数据
hive> select * from mysql_table limit 10;
OK
at-spi-python-1.28.1	1
mesa-dri-drivers-9.2	1
m4-1.4.13-5.el6.x86_	1
libXxf86vm-1.1.2-2.e	1
libuuid-2.17.2-12.14	1
httpd-2.2.15-29.el6.	1
libgweather-2.28.0-5	1
lvm2-2.02.100-8.el6.	1
xmlrpc-c-1.16.24-121	1
parted-2.1-21.el6.x8	1
Time taken: 0.148 seconds, Fetched: 10 row(s)
hive> select * from mysql_table order by cnt desc limit 10;
Query ID = hdfs_20180830010430_68021b2b-9bd8-44a0-92df-dbe0c717053a
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.


Status: Running (Executing on YARN cluster with App id application_1535363634130_0005)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 17.63 s    
--------------------------------------------------------------------------------
OK
Installing	918
warning:	2
***	2
tcp_wrappers-libs-7.	1
m4-1.4.13-5.el6.x86_	1
httpd-2.2.15-29.el6.	1
libgweather-2.28.0-5	1
libXxf86vm-1.1.2-2.e	1
xmlrpc-c-1.16.24-121	1
parted-2.1-21.el6.x8	1
Time taken: 33.713 seconds, Fetched: 10 row(s)

1. 可能会遇到的错误

2018-08-29 22:37:54.327 [job-0] ERROR JobContainer - Exception when job run com.alibaba.datax.common.
exception.DataXException: Code:[DBUtilErrorCode-01], Description:[获取表字段相关信息失败.].  -
获取表:test_table 的字段的元信息时失败. 请联系 DBA 核查该库、表信息. -
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'id' in 'field list'

原因:“column”: [ ] 中定义的列名与 mysql 不一致

DataX智能分析,该任务最可能的错误原因是:com.alibaba.datax.common.exception.DataXException: 
Code:[UnstructuredStorageReader-03], Description:[您填写的参数值不合法.]. - 仅仅支持单字符切分, 
您配置的切分为 : ['01']

原因:分隔符问题,详见上面的特别注意

2.MysqlWriter 针对 Mysql 类型转换列表

DataX 内部类型Mysql 数据类型
Longint, tinyint, smallint, mediumint, int, bigint, year
Doublefloat, double, decimal
Stringvarchar, char, tinytext, text, mediumtext, longtext
Datedate, datetime, timestamp, time
Booleanbit, bool
Bytestinyblob, mediumblob, blob, longblob, varbinary
  • bit类型目前是未定义类型转换

3.MysqlReader 针对 Mysql 类型转换列表

DataX 内部类型Mysql 数据类型
Longint, tinyint, smallint, mediumint, int, bigint
Doublefloat, double, decimal
Stringvarchar, char, tinytext, text, mediumtext, longtext, year
Datedate, datetime, timestamp, time
Booleanbit, bool
Bytestinyblob, mediumblob, blob, longblob, varbinary

请注意:

  • 除上述罗列字段类型外,其他类型均不支持
  • tinyint(1) DataX视作为整形
  • year DataX视作为字符串类型
  • bit DataX属于未定义行为