简介
Apache Sqoop 是一种旨在在Apache Hadoop和结构化数据存储(如关系型数据库)之间高效传输批量数据的工具 。Sqoop的作业最终还是会提交到YARN上去执行(只跑Map,不跑Reduce)。
一般生产数仓流程
业务数据存放在MySQL中,后续需要大数据平台来进行处理
1) MySQL导入到Hadoop(Hive/HBase...)
2) 借助于大数据的分布式引擎:Hive/Spark/Flink...
3) 处理完之后得到结果(结果数据还是存储在Hadoop)
4) 分析处理结果 导出到 RDBMS(MySQL)
5) UI对接RDBMS的数据来进行展示
Spring Boot + Spring Data + React
SQOOP
部署
解压建立软连接
[hadoop@xinxingdata001 software]$ tar -zxvf sqoop-1.4.6-cdh5.16.2.tar.gz -C ../app
[hadoop@xinxingdata001 app]$ ln -s sqoop-1.4.6-cdh5.16.2/ sqoop
配置环境变量
[hadoop@xinxingdata001 app]$ vim ~/.bashrc
#set sqoop environment
export SQOOP_HOME=/home/hadoop/app/sqoop
export PATH=$SQOOP_HOME/bin:$PATH
[hadoop@xinxingdata001 app]$ source ~/.bashrc
配置sqoop-env.sh
[hadoop@xinxingdata001 ~]$ cd /home/hadoop/app/sqoop/conf/
[hadoop@xinxingdata001 conf]$ cp sqoop-env-template.sh sqoop-env.sh
[hadoop@xinxingdata001 conf]$ vim sqoop-env.sh
#Set path to where bin/hadoop is available //Hadoop目录
export HADOOP_COMMON_HOME=/home/hadoop/app/hadoop
#Set path to where hadoop-*-core.jar is available //Hadoop目录
export HADOOP_MAPRED_HOME=/home/hadoop/app/hadoop
#Set the path to where bin/hive is available //Hive目录
export HIVE_HOME=/home/hadoop/app/hive
拷贝JDBC驱动到sqoop/lib目录下
[hadoop@xinxingdata001 conf]$ cd ~/software/
[hadoop@xinxingdata001 software]$ cp mysql-connector-java-5.1.47.jar ../app/sqoop/lib/mysql-connector-java.jar
使用
sqoop help命令
[hadoop@xinxingdata001 software]$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
查看版本号
查看 sqoop help列出的命令如何使用
[hadoop@xinxingdata001 software]$ sqoop help list-databases
usage: sqoop list-databases [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
--connect <jdbc-uri> Specify JDBC connect string
--driver <class-name> Manually specify JDBC driver class to use
--password <password> Set authenticati on password
--username <username> Set authenticati on username
[hadoop@xinxingdata001 software]$ sqoop list-databases \
--connect jdbc:mysql://192.168.66.66:3306 \
--password xinxingdata \
--username root
[hadoop@xinxingdata001 software]$ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Import control arguments:
--append Imports data in append mode
--as-textfile Imports data as plain text (default)
--columns <col,col,col...> Columns to import from table
--compression-codec <codec> Compression codec to use for import
--delete-target-dir Imports data in delete mode
-e,--query <statement> Import results of SQL 'statement'
-m,--num-mappers <n> Use 'n' map tasks to import in paralle
--split-by <column-name> Column of the table used to split work units
--mapreduce-job-name <name> Set name for generated mapreduce job
--table <table-name> Table to read
--target-dir <dir> HDFS plain table destination
--warehouse-dir <dir> HDFS parent for table destination
--where <where clause> WHERE clause to use during import
Output line formatting arguments:
--enclosed-by <char> Sets a required field enclosing character
--escaped-by <char> Sets the escape character
--fields-terminated-by <char> Sets the field separator character
--lines-terminated-by <char> Sets the end-of-line character
Hive arguments:
--create-hive-table Fail if the target hive table exists
--hive-database <database-name> Sets the database name to use when importing to hive
--hive-import Import tables into Hive (Uses Hive's default delimiters if none are set.)
--hive-overwrite Overwrite existing data in the Hive table
--hive-partition-key <partition-key> Sets the partition key to use when importing to hive
--hive-partition-value <partition-value> Sets the partition value to use when importing to hive
--hive-table <table-name> Sets the table name to use when importing to hive
--map-column-hive <arg> Override mapping for specific column to hive types.
Code generation arguments:
--null-non-string <null-str> Null non-string representation
--null-string <null-str> Null string representation
通过命令导入Mysql表到HDFS
sqoop import \
--connect jdbc:mysql://192.168.66.66:3306/xinxingdata \
--password ********** \
--username root \
--table emp
这里报错就去下载
java-json.jar
查看导入的emp表
[hadoop@xinxingdata001 ~]$ hdfs dfs -text /user/hadoop/emp/part*
通过查询导入Mysql表到HDFS
sqoop import \
--connect jdbc:mysql://192.168.66.66:3306/xinxingdata \
--password xinxingdata \
--username root \
--mapreduce-job-name xinxing \
--target-dir emp \
--delete-target-dir \
--fields-terminated-by '\t' \
--null-non-string '0' \
--null-string ' ' \
--query 'select * from emp where sal >2000 and $CONDITIONS' \
-m 1
[hadoop@xinxingdata001 ~]$ hdfs dfs -text /user/hadoop/emp/p*
sqoop通过文件传参导入
[hadoop@xinxingdata001 shell]$ vim sqoop_file
import
--connect
jdbc:mysql://192.168.66.66:3306/xinxingdata
--password
xinxingdata
--username
root
--target-dir
emp
--delete-target-dir
--fields-terminated-by
'\t'
--query
'select * from emp where sal >2000 and $CONDITIONS'
-m
1
[hadoop@xinxingdata001 shell]$ sqoop --options-file sqoop_file
sqoop从Hadoop导出到Mysql
[hadoop@xinxingdata001 ~]$ sqoop help export
usage: sqoop export [GENERIC-ARGS] [TOOL-ARGS]
Export control arguments:
...
--export-dir <dir> HDFS source path for the export
...
sqoop export \
--connect jdbc:mysql://localhost:3306/xinxingdata \
--password xinxingdata \
--username root \
--table emp_demo \
--mapreduce-job-name FromMySQL2HDFS \
--fields-terminated-by '\t' \
--export-dir /user/hadoop/emp \
-m 2
mysql查询是否成功导出
SQOOP之Hive导入/导出
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Hive arguments:
--create-hive-table Fail if the target hive table exists
--hive-database <database-name> Sets the database name to use when importing to hive
--hive-drop-import-delims Drop Hive record \0x01 and row delimiters (\n\r) from imported string fields
--hive-import Import tables into Hive (Uses Hive's default delimiters if none are set.)
--hive-overwrite Overwrite existing data in the Hive table
--hive-partition-key <partition-key> Sets the partition key to use when importing to hive
--hive-partition-value <partition-value> Sets the partition value to use when importing to hive
--hive-table <table-name> Sets the table name to use when importing to hive
--map-column-hive <arg> Override mapping for specific column to hive types.
导入到Hive表中
此处不建议带
--create-hive-table参数,因为会导致数据类型错误,建议先在Hive创建表结构后在导入数据
sqoop import \
--connect jdbc:mysql://localhost:3306/xinxingdata \
--password xinxingdata \
--username root \
--table emp \
--create-hive-table \
--hive-table hive_emp \
--hive-import \
--delete-target-dir \
--mapreduce-job-name Xinxing2HDFS \
-m 2
报错Make sure HIVE_CONF_DIR is set correctly.是由于缺少hive的lib包,拷贝即可
[hadoop@xinxingdata001 ~]$ cp ~/app/hive/lib/hive-exec-1.1.0-cdh5.16.2.jar ~/app/sqoop/lib/
从Hive导出到Mysql表中
sqoop export \
--connect jdbc:mysql://localhost:3306/xinxingdata \
--password xinxingdata \
--username root \
--export-dir /user/hive/warehouse/hive_emp \
--table emp_hive \
--fields-terminated-by '\t' \
--mapreduce-job-name Xinxing2HDFS \
-m 2