使用SQOOP全(增)量数据到Hive表

1,842 阅读5分钟

前言

Sqoop’s import tool’s main function is to upload your data into files in HDFS. If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line.

Sqoop的导入工具的主要功能是将您的数据上传到HDFS中的文件中。如果您拥有与HDFS群集关联的Hive元存储,Sqoop还可以通过生成并执行CREATE TABLE语句在Hive中定义数据的布局,将数据导入Hive。将数据导入Hive在Sqoop命令行中添加--hive-import选项。查询导入语句时,不必翻看官方文档,只需在shell中执行sqoop help命令即可。

如果你对SQOOP尚未了解,请移步至Apache SQOOP这篇文章。
导入过程中出现的Error你也可以在文章中寻找到解决方式

表结构与数据源

  • Mysql表结构
create table mysql_student(
id int,
name varchar(10),
age int,
sex varchar(5),
create_date DATE,
primary key (id)
);
  • Mysql插入数据
insert into mysql_student values
(1,'xinxing',22,'boy','2020-06-12'),
(2,'xiaohong',26,'girl','2020-06-17'),
(3,'xiaoming',22,'boy','2020-06-13'),
(4,'shishi',17,'girl','2020-05-12'),
(5,'xiaoli',30,'boy','2020-07-12');
  • Mysql更新数据源
insert into mysql_student values
(6,'liudehua',22,'boy','2020-06-12'),
(7,'yangmi',26,'girl','2020-06-17'),
(8,'yangyang',22,'boy','2020-06-13'),
(9,'hejiong',17,'girl','2020-05-12'),
(10,'xiena',30,'boy','2020-07-12');
  • Hive表结构
create table hive_student(
id int,
name string,
age int,
sex string,
create_date string
)
comment '全量导入学生表'
row format delimited fields terminated by ',';

抽取MYSQL数据到(HDFS)Hive单分区中

全量抽取

sqoop import \
--connect jdbc:mysql://xinxingdata001:3306/mysqlXinxing \
--username root \
--password 123456 \
--hive-database xinxing \
--table mysql_student \
--hive-table hive_student \
--hive-import \
--delete-target-dir \
--fields-terminated-by ',' \
--mapreduce-job-name Xinxing00001Job;

查询hive_student

hive (xinxing)> select * from hive_student;
OK
hive_student.id	hive_student.name	hive_student.age	hive_student.sex	hive_student.create_date
1               xinxing	                      22                  boy	                2020-06-12
2               xiaohong                      26                  girl                  2020-06-17
3               xiaoming                      22                  boy	                2020-06-13
4               shishi	                      17                  girl                  2020-05-12
5               xiaoli	                      30                  boy	                2020-07-12
Time taken: 0.04 seconds, Fetched: 5 row(s)

增量抽取

–check-column用来指定一些列,这些列在导入时用来检查做决定数据是否要被作为增量数据。注意:字符类型不能作为增量标识字段
–incremental用来指定增量导入的模式(Mode),append和lastmodified
–last-value指定上一次导入中检查列指定字段最大值

sqoop import \
--connect jdbc:mysql://xinxingdata001:3306/mysqlXinxing \
--username root \
--password 123456 \
--hive-database xinxing \
--table mysql_student \
--hive-table hive_student \
--hive-import \
--fields-terminated-by ',' \
--check-column id \
--incremental append \ 
--last-value 5 \
--mapreduce-job-name Xinxing00001Job \

查询hive_student

hive (xinxing)> select * from hive_student;
OK
hive_student.id	hive_student.name	hive_student.age	hive_student.sex	hive_student.create_date
1	                xinxing	                22              	boy	                2020-06-12
6	                liudehua                22              	boy	                2020-06-12
2	                xiaohong                26              	girl                	2020-06-17
7	                yangmi	                26              	girl                	2020-06-17
3	                xiaoming                22              	boy	                2020-06-13
8	                yangyang                22              	boy	                2020-06-13
4	                shishi	                17              	girl                	2020-05-12
9	                hejiong	                17              	girl                	2020-05-12
5	                xiaoli	                30              	boy	                2020-07-12
10	                xiena	                30              	boy	                2020-07-12
Time taken: 0.031 seconds, Fetched: 10 row(s)

增量抽取十有问题所在的,如果第二条数据xiaohong同学做了变性手术,性别更改为了boy,增量是无法实现数据更新的。后面会引伸出拉链表来解决此类问题。

抽取MYSQL数据到(HDFS)Hive多分区中

全量抽取

创建hive分区表结构

create table hive_Partition_student(
id int,
name string,
age int,
sex string,
create_date string
)
comment '全量导入学生表'
partitioned by (day string)
row format delimited fields terminated by ',';

抽取

sqoop import \
--connect jdbc:mysql://xinxingdata001:3306/mysqlXinxing \
--username root \
--password 123456 \
--hive-database xinxing \
--table mysql_student \
--hive-table hive_Partition_student \
--hive-import \
--delete-target-dir \
--fields-terminated-by ',' \
--hive-partition-key day \
--hive-partition-value 12 \
--mapreduce-job-name Xinxing00001Job;

查询hive_Partition_student

hive (xinxing)> select * from hive_Partition_student where day=12;
OK
hive_partition_student.id	hive_partition_student.name	hive_partition_student.age	hive_partition_student.sex	hive_partition_student.create_date   hive_partition_student.day
1	xinxing	22	boy	2020-06-12	12
2	xiaohong	26	girl	2020-06-17	12
3	xiaoming	22	boy	2020-06-13	12
4	shishi	17	girl	2020-05-12	12

增量抽取

sqoop import \
--connect jdbc:mysql://xinxingdata001:3306/mysqlXinxing \
--username root \
--password 123456 \
--hive-import \
--hive-database xinxing \
--table mysql_student \
--hive-table hive_Partition_student \
--fields-terminated-by ',' \
--check-column id \
--incremental append \
--last-value 4 \
--hive-partition-key day \
--hive-partition-value 13 \
--mapreduce-job-name Xinxing00001Job;

查询hive_Partition_student

hive (xinxing)> select * from hive_partition_student;
OK
hive_partition_student.id	hive_partition_student.name	hive_partition_student.age	hive_partition_student.sex	hive_partition_student.create_date   hive_partition_student.day
1	xinxing	22	boy	2020-06-12	12
2	xiaohong	26	girl	2020-06-17	12
3	xiaoming	22	boy	2020-06-13	12
4	shishi	17	girl	2020-05-12	12
6	liudehua	22	boy	2020-06-12	13
7	yangmi	26	girl	2020-06-17	13
8	yangyang	22	boy	2020-06-13	13
9	hejiong	17	girl	2020-05-12	13
10	xiena	30	boy	2020-07-12	13
Time taken: 0.03 seconds, Fetched: 9 row(s)

Hive增量更新数据

创建Hive表结构

create table hive_student(
id int,
name string,
age int,
sex string,
create_date string
)
comment '全量导入学生表'
row format delimited fields terminated by ',';

抽取全量数据

sqoop import \
--connect jdbc:mysql://xinxingdata001:3306/mysqlXinxing \
--username root \
--password 123456 \
--hive-database xinxing \
--table mysql_student \
--hive-table hive_student \
--hive-import \
--delete-target-dir \
--fields-terminated-by ',' \
--mapreduce-job-name Xinxing00001Job;

mysql插入新数据并update之前存在的数据字段值

把更新之后的Mysql_student导入到Hive_ods层中

sqoop import \
--connect jdbc:mysql://xinxingdata001:3306/mysqlXinxing \
--username root \
--password 123456 \
--hive-database xinxing \
--table mysql_student \
--hive-table hive_ods_student \
--hive-import \
--delete-target-dir \
--fields-terminated-by ',' \
--mapreduce-job-name Xinxing00001Job;

将两个表进行left join,将hive_ods_student表中发生修改的数据更新到hive_ods_student表中

insert overwrite table hive_student
select
  a.*
from
(
   select * from hive_student
) as a
left join
(
   select * from hive_ods_student
) as b
on a.id = b.id
where b.id is null
union all
select
  *
from hive_ods_student