Apache SQOOP

1,052 阅读3分钟

简介

Apache Sqoop 是一种旨在在Apache Hadoop和结构化数据存储(如关系型数据库)之间高效传输批量数据的工具 。Sqoop的作业最终还是会提交到YARN上去执行(只跑Map,不跑Reduce)。

一般生产数仓流程

业务数据存放在MySQL中,后续需要大数据平台来进行处理

1) MySQL导入到Hadoop(Hive/HBase...)

2) 借助于大数据的分布式引擎:Hive/Spark/Flink...

3) 处理完之后得到结果(结果数据还是存储在Hadoop)

4) 分析处理结果 导出到 RDBMS(MySQL)

5) UI对接RDBMS的数据来进行展示
	Spring Boot + Spring Data + React
	

SQOOP

部署

解压建立软连接

[hadoop@xinxingdata001 software]$ tar -zxvf sqoop-1.4.6-cdh5.16.2.tar.gz -C ../app
[hadoop@xinxingdata001 app]$ ln -s sqoop-1.4.6-cdh5.16.2/ sqoop

配置环境变量

[hadoop@xinxingdata001 app]$ vim ~/.bashrc

#set sqoop environment
export SQOOP_HOME=/home/hadoop/app/sqoop
export PATH=$SQOOP_HOME/bin:$PATH

[hadoop@xinxingdata001 app]$ source ~/.bashrc

配置sqoop-env.sh

[hadoop@xinxingdata001 ~]$ cd /home/hadoop/app/sqoop/conf/
[hadoop@xinxingdata001 conf]$ cp sqoop-env-template.sh  sqoop-env.sh
[hadoop@xinxingdata001 conf]$ vim sqoop-env.sh
#Set path to where bin/hadoop is available			//Hadoop目录
export HADOOP_COMMON_HOME=/home/hadoop/app/hadoop

#Set path to where hadoop-*-core.jar is available	//Hadoop目录
export HADOOP_MAPRED_HOME=/home/hadoop/app/hadoop

#Set the path to where bin/hive is available		//Hive目录
export HIVE_HOME=/home/hadoop/app/hive

拷贝JDBC驱动到sqoop/lib目录下

[hadoop@xinxingdata001 conf]$ cd ~/software/
[hadoop@xinxingdata001 software]$ cp mysql-connector-java-5.1.47.jar ../app/sqoop/lib/mysql-connector-java.jar

使用

sqoop help命令

[hadoop@xinxingdata001 software]$ sqoop help

usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

查看版本号

查看 sqoop help列出的命令如何使用

[hadoop@xinxingdata001 software]$ sqoop help list-databases

usage: sqoop list-databases [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
   --connect <jdbc-uri>					 					Specify JDBC connect string
 
	--driver <class-name>									Manually specify JDBC driver class to use
   
   --password <password>                                      Set authenticati on password

   --username <username>                                      Set authenticati on username
[hadoop@xinxingdata001 software]$ sqoop list-databases  \
								--connect jdbc:mysql://192.168.66.66:3306 \
								--password xinxingdata \
								--username root


[hadoop@xinxingdata001 software]$ sqoop help import

usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Import control arguments:
   --append                                                   Imports data in append mode
   --as-textfile                                              Imports data as plain text  (default)
   --columns <col,col,col...>                                 Columns to import from table
   --compression-codec <codec>                                Compression codec to use for import
   --delete-target-dir                                        Imports data in delete mode
	-e,--query <statement>                                        Import results of SQL 'statement'
	-m,--num-mappers <n>                                          Use 'n' map tasks to import in  paralle
	--split-by <column-name>                                   Column of the table used to split work units
   --mapreduce-job-name <name>                                Set name for generated mapreduce job                     
   --table <table-name>                                       Table to read
   --target-dir <dir>                                         HDFS plain table destination
   --warehouse-dir <dir>                                      HDFS parent for table destination
   --where <where clause>                                     WHERE clause to use during import
   
Output line formatting arguments:
   --enclosed-by <char>               Sets a required field enclosing character
   --escaped-by <char>                Sets the escape character
   --fields-terminated-by <char>      Sets the field separator character
   --lines-terminated-by <char>       Sets the end-of-line character

Hive arguments:
   --create-hive-table                         Fail if the target hive table exists
   --hive-database <database-name>             Sets the database name to use when importing to hive
   --hive-import                               Import tables into Hive (Uses Hive's default delimiters if none are set.)
   --hive-overwrite                            Overwrite existing data in the Hive table
   --hive-partition-key <partition-key>        Sets the partition key to use when importing to hive
   --hive-partition-value <partition-value>    Sets the partition value to use when importing to hive
   --hive-table <table-name>                   Sets the table name to use when importing to hive
   --map-column-hive <arg>                     Override mapping for specific column to hive types.
   
Code generation arguments:
   --null-non-string <null-str>               Null non-string representation
   --null-string <null-str>                   Null string representation

通过命令导入Mysql表到HDFS

sqoop import \
--connect jdbc:mysql://192.168.66.66:3306/xinxingdata \
--password ********** \
--username root \
--table emp

这里报错就去下载java-json.jar

查看导入的emp表

[hadoop@xinxingdata001 ~]$ hdfs dfs -text /user/hadoop/emp/part*

通过查询导入Mysql表到HDFS

sqoop import \
--connect jdbc:mysql://192.168.66.66:3306/xinxingdata \
--password xinxingdata \
--username root \
--mapreduce-job-name xinxing \
--target-dir emp \
--delete-target-dir \
--fields-terminated-by '\t' \
--null-non-string '0' \
--null-string ' ' \
--query 'select * from emp where sal >2000 and $CONDITIONS' \
-m 1
[hadoop@xinxingdata001 ~]$ hdfs dfs -text /user/hadoop/emp/p*

sqoop通过文件传参导入

[hadoop@xinxingdata001 shell]$ vim sqoop_file
import
--connect
jdbc:mysql://192.168.66.66:3306/xinxingdata
--password
xinxingdata
--username
root
--target-dir
emp
--delete-target-dir
--fields-terminated-by
'\t'
--query
'select * from emp where sal >2000 and $CONDITIONS'
-m
1
[hadoop@xinxingdata001 shell]$ sqoop --options-file sqoop_file 

sqoop从Hadoop导出到Mysql

[hadoop@xinxingdata001 ~]$ sqoop help export
usage: sqoop export [GENERIC-ARGS] [TOOL-ARGS]

Export control arguments:
...
  --export-dir <dir>                                         HDFS source  path for the export
...
sqoop export \
--connect jdbc:mysql://localhost:3306/xinxingdata \
--password xinxingdata \
--username root \
--table emp_demo \
--mapreduce-job-name FromMySQL2HDFS \
--fields-terminated-by '\t' \
--export-dir /user/hadoop/emp \
-m 2

mysql查询是否成功导出

SQOOP之Hive导入/导出

usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Hive arguments:
   --create-hive-table                         Fail if the target hive  table exists
   --hive-database <database-name>             Sets the database name to use when importing to hive
   --hive-drop-import-delims                   Drop Hive record \0x01 and  row delimiters (\n\r) from imported string fields
   --hive-import                               Import tables into Hive (Uses Hive's default delimiters if none are set.)
   --hive-overwrite                            Overwrite existing data in the Hive table
   --hive-partition-key <partition-key>        Sets the partition key to use when importing to hive
   --hive-partition-value <partition-value>    Sets the partition value to use when importing to hive
   --hive-table <table-name>                   Sets the table name to use when importing to hive
   --map-column-hive <arg>                     Override mapping for specific column to hive types.

导入到Hive表中

此处不建议带--create-hive-table参数,因为会导致数据类型错误,建议先在Hive创建表结构后在导入数据

sqoop import \
--connect jdbc:mysql://localhost:3306/xinxingdata \
--password xinxingdata \
--username root \
--table emp \
--create-hive-table \
--hive-table hive_emp \
--hive-import \
--delete-target-dir \
--mapreduce-job-name Xinxing2HDFS \
-m 2

报错Make sure HIVE_CONF_DIR is set correctly.是由于缺少hive的lib包,拷贝即可
[hadoop@xinxingdata001 ~]$ cp ~/app/hive/lib/hive-exec-1.1.0-cdh5.16.2.jar ~/app/sqoop/lib/

从Hive导出到Mysql表中

sqoop export \
--connect jdbc:mysql://localhost:3306/xinxingdata \
--password xinxingdata \
--username root \
--export-dir /user/hive/warehouse/hive_emp \
--table emp_hive \
--fields-terminated-by '\t' \
--mapreduce-job-name Xinxing2HDFS \
-m 2