大数据Sqoop安装与使用

523 阅读5分钟

sqoop安装

Sqoop简介

Sqoop(发音:skup)是一款开源的工具,主要用于在Hadoop(Hive)与传统的数据库(mysql、postgresql...)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。

说明

官网地址:sqoop.apache.org/

在releases下面下载对应版本

现在地址: www.apache.org/dyn/closer.…

准备工作

上传到sqoop-1.99.7-bin-hadoop200.tar.gz到linux 并解压到 /root/apps

解压到安装目录apps

[root@linux1 apps]# ll /root/apps
总用量 52
drwxr-xr-x.  3 root root  4096 4月  20 16:12 appdata
drwxr-xr-x. 10 1000 ftp   4096 2月  21 2010 hadoop-2.7.7
drwxr-xr-x.  8 uucp   143 4096 6月  22 2016 jdk1.8.0_101
drwxr-xr-x.  8 root root  4096 5月  20 21:28 sqoop-1.99.7-bin-hadoop200
[root@linux1 apps]# 

配置sqoop的环境变量并使之生效

[root@linux01 sqoop-1.99.7-bin-hadoop200]# vi /etc/profile
export SQOOP_HOME=/root/apps/sqoop-1.99.7-bin-hadoop200
export PATH=$PATH:$SQOOP_HOME/bin
export SQOOP_SERVER_EXTRA_LIB=$SQOOP_HOME/extra
export CATALINA_BASE=$SQOOP_HOME/server
export LOGDIR=$SQOOP_HOME/logs/
[root@master sqoop]#source /etc/profile

创建相关目录

[root@master sqoop]# mkdir ${SQOOP_HOME}/extra
[root@master sqoop]# mkdir ${SQOOP_HOME}/logs

修改配置文件sqoop.properties

[root@master ~]# cd /root/apps/sqoop-1.99.7-bin-hadoop200/conf
[root@linux01 conf]# 
[root@linux01 conf]# ll
总用量 3
-rw-r--r--. 1 root root  668 5月  20 21:51 catalina.properties
-rw-r--r--. 1 root root 1584 7月  20 2016 sqoop_bootstrap.properties
-rw-r--r--. 1 root root 9731 5月  20 21:40 sqoop.properties
[root@linux01 conf]# 
[root@linux01 conf]# vi  sqoop.properties

搜索org.apache.sqoop.submission.engine.mapreduce.configuration.directory配置hadoop的配置路径

org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/root/apps/hadoop-2.7.7/etc/hadoop

添加catalina.properties文件

说明:在conf目录下,添加catalina.properties文件。加入本机hadoop的相关jar路径,如下所示:

[root@linux01 conf]# cd /root/apps/sqoop-1.99.7-bin-hadoop200/conf
[root@linux01 conf]# ll
总用量 20
-rw-r--r--. 1 root root  668 5月  20 21:51 catalina.properties
-rw-r--r--. 1 root root 1584 7月  20 2016 sqoop_bootstrap.properties
-rw-r--r--. 1 root root 9731 5月  20 21:40 sqoop.properties
[root@linux01 conf]# vi catalina.properties

注意下面:这里唯一要修改的是你的hadoop安装路径是什么。

common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,/root/apps/hadoop-2.7.7/share/hadoop/common/*.jar,/root/apps/hadoop-2.7.7/share/hadoop/common/lib/*.jar,/root/apps/hadoop-2.7.7/share/hadoop/hdfs/*.jar,/root/apps/hadoop-2.7.7/share/hadoop/hdfs/lib/*.jar,/root/apps/hadoop-2.7.7/share/hadoop/mapreduce/*.jar,/root/apps/hadoop-2.7.7/share/hadoop/mapreduce/lib/*.jar,/root/apps/hadoop-2.7.7/share/hadoop/tools/lib/*.jar,/root/apps/hadoop-2.7.7/share/hadoop/yarn/*.jar,/root/apps/hadoop-2.7.7/share/hadoop/yarn/lib/*.jar,/root/apps/hadoop-2.7.7/share/hadoop/httpfs/tomcat/lib/*.jar

修改hadoop 的core-site.xml文件

添加以下内容

<configuration>
  <property>
      <name>fs.defaultFS</name>
      <value>hdfs://linux01:9000</value>
  </property>
  <property>
      <name>hadoop.tmp.dir</name>
      <value>/root/apps/appdata</value>
  </property>

    <!--sqoop需要添加的配置-->
  <property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
  </property>
</configuration>

sqoop命令

启动sqoop服务

[root@linux01 ~]# ${SQOOP_HOME}/bin/sqoop.sh server start
Setting conf dir: ./../conf
Sqoop home directory: /root/apps/sqoop-1.99.7-bin-hadoop200
Starting the Sqoop2 server...
1    [main] INFO  org.apache.sqoop.core.SqoopServer  - Initializing Sqoop server.
54   [main] INFO  org.apache.sqoop.core.PropertiesConfigurationProvider  - Starting config file poller thread
Sqoop2 server started.
[root@linux01 ~]# jps
1842 SecondaryNameNode
1653 NameNode
2278 SqoopJettyServer
2298 Jps
1998 ResourceManager
[root@linux01 ~]# 

进入sqoop client

[root@linux01 bin]# ./sqoop.sh client
Setting conf dir: ./../conf
Sqoop home directory: /root/apps/sqoop-1.99.7-bin-hadoop200
Sqoop Shell: Type 'help' or '\h' for help.

sqoop:000> 

或者

[root@linux01 bin]# ./sqoop2-shell
Setting conf dir: /root/apps/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /root/apps/sqoop-1.99.7-bin-hadoop200
Sqoop Shell: Type 'help' or '\h' for help.

sqoop:000> 

未知的设置需要研究

set server --host linux01 --port 12000 --webapp sqoop

查看并设置输出级别

true输出的更加详细,可以跟踪错误信息,便于调试。


sqoop:000> show option --name verbose
Verbose = false
sqoop:000> set option --name verbose --value true
Verbose option was changed to true
sqoop:000> show option --name verbose
Verbose = true
sqoop:000> 

删除link

sqoop:000> delete link -n  linkName

查看 job

sqoop:000> show job 

删除job

sqoop:000> delete job -n jobName

查看可用连接器

sqoop:002> show connector
+------------------------+---------+------------------------------------
|          Name          | Version |    Class      | Supported Directions|
+------------------------+---------+-------------------------------------
| generic-jdbc-connector | 1.99.7  | GenericJdbcConnector | FROM/TO |
| kite-connector         | 1.99.7  | KiteConnector        | FROM/TO|
| oracle-jdbc-connector  | 1.99.7  | OracleJdbcConnector  | FROM/TO|
| ftp-connector          | 1.99.7  | FtpConnector         | TO           |
| hdfs-connector         | 1.99.7  | HdfsConnector        | FROM/TO|
| kafka-connector        | 1.99.7  | KafkaConnector       | TO           |
| sftp-connector         | 1.99.7  | SftpConnector        | TO           |
+------------------------+---------+--------------------------------------
sqoop:002> 

查看已经创建得link

sqoop:002> show link
+--------+------------------------+---------+
|  Name  |     Connector Name     | Enabled |
+--------+------------------------+---------+
| mysql1 | generic-jdbc-connector | true    |
| mysql2 | generic-jdbc-connector | true    |
| hdfs1  | hdfs-connector         | true    |
| hdfs2  | hdfs-connector         | true    |
+--------+------------------------+---------+
sqoop:002> 

创建mysqllink

说明:

  1. 前提要安装mysql准备表数据(注意:表要有主键否则报错)并授权给所有的hadoop的从节点,
  2. 在/root/apps/sqoop-1.99.7-bin-hadoop200/server/lib下面要复制一份mysql的驱动。
sqoop:000> create link -c generic-jdbc-connector
Creating link for connector with name generic-jdbc-connector
Please fill following values to create new link object
Name: mysql2

Database connection

Driver class: com.mysql.jdbc.Driver
Connection String: jdbc:mysql://linux01:3306/test
Username: root
Password: ******
Fetch Size: 
Connection Properties: 
There are currently 0 values in the map:
entry# 

SQL Dialect

Identifier enclose:  
New link was successfully created with validation status OK and name mysql2
sqoop:000> 
sqoop:002> show link
+--------+------------------------+---------+
|  Name  |     Connector Name     | Enabled |
+--------+------------------------+---------+
| mysql2 | generic-jdbc-connector | true    |
+--------+------------------------+---------+
sqoop:002> 

注意SQL Dialect 的下面Identifier enclose项需要输入一个空格,然后回车

创建hfds link

sqoop:000> create link -c hdfs-connector 
Creating link for connector with name hdfs-connector
Please fill following values to create new link object
Name: hdfs2

HDFS cluster

URI: hdfs://linux01:9000
Conf directory: /root/apps/hadoop-2.7.7/etc/hadoop
Additional configs:: 
There are currently 0 values in the map:
entry# 
New link was successfully created with validation status OK and name hdfs2
sqoop:000> 
sqoop:000> show link
+--------+------------------------+---------+
|  Name  |     Connector Name     | Enabled |
+--------+------------------------+---------+
| mysql1 | generic-jdbc-connector | true    |
| hdfs2  | hdfs-connector         | true    |
+--------+------------------------+---------+

创建 job

create job -f formLinkName -t toLinkName

案例:

注意下面红色部分 
sqoop:000> create job -f mysql-link -t hdfs-link 
Creating job for links with from name mysql-link and to name hdfs-link 
Please fill following values to create new job object 
Name: test1(job名称)

Database source

Schema name: test(数据库的schema名称) 
Table name: user(数据库表名) 
SQL statement: 
Column names: 
There are currently 0 values in the list: 
element# 
Partition column: 
Partition column nullable: 
Boundary query:

Incremental read

Check column: 
Last value:

Target configuration

Override null value: 
Null value: 
File format: 
0 : TEXT_FILE 
1 : SEQUENCE_FILE 
2 : PARQUET_FILE 
Choose: 0(选择文本文件) 
Compression codec: 
0 : NONE 
1 : DEFAULT 
2 : DEFLATE 
3 : GZIP 
4 : BZIP2 
5 : LZO 
6 : LZ4 
7 : SNAPPY 
8 : CUSTOM 
Choose: 0(选择NONE) 
Custom codec: 
Output directory: /root/test(这里输入HDFS文件的目录,需要是空目录) 
Append mode:

Throttling resources

Extractors: 2(这里是参考官网填的2) 
Loaders: 2(这里是参考官网填的2)

Classpath configuration

Extra mapper jars: 
There are currently 0 values in the list: 
element# 
New job was successfully created with validation status OK and name test1 
sqoop:000>

启动job

sqoop:002> start job -name mysql2hdfs2
Submission details
Job Name: mysql2hdfs2
Server URL: http://localhost:12000/sqoop/
Created by: root
Creation date: 2020-05-21 20:37:41 CST
Lastly updated by: root
External ID: job_1590052132270_0001
        http://linux01:8088/proxy/application_1590052132270_0001/
Source Connector schema: Schema{name= test . emp ,columns=[
        FixedPoint{name=empno,nullable=true,type=FIXED_POINT,byteSize=4,signed=true},
        Text{name=ename,nullable=true,type=TEXT,charSize=null}]}
2020-05-21 20:37:41 CST: BOOTING  - Progress is not available
sqoop:002> 

查看hdfs

[root@linux03 hadoop]# hadoop fs -ls /a
20/05/21 20:47:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   3 root supergroup          0 2020-05-21 20:38 /a/768ac6af-c8c7-4a4f-9350-7c6f6df4ce31.txt
-rw-r--r--   3 root supergroup         22 2020-05-21 20:38 /a/ef9f9632-8120-4663-9d00-10bc1231a904.txt
[root@linux03 hadoop]# hadoop fs -cat /a/ef9f9632-8120-4663-9d00-10bc1231a904.txt
1,'zhangsan'
2,'lisi'
[root@linux03 hadoop]# 

mysql 导入hdfs案例

准备工作

  1. 安装hadoop并启动所有服务
  2. 在HDFS上创建一个目录用于存放导入的数据

​ 创建命令:hadoop fs -mkdir /sqoopinput

​ sqoopinput 是创建得目录

  1. 安装mysql并授权给hadoop的节点并准备一张表数据

  2. 安装sqoop1.99.7

具体导入操作

  1. 启动sqoop服务
[root@linux01 server]# cd ${SQOOP_HOME}/bin
[root@linux01 bin]# 
[root@linux01 bin]# 
[root@linux01 bin]# pwd
/root/apps/sqoop-1.99.7-bin-hadoop200/bin
[root@linux01 bin]# ll
总用量 32
drwxr-xr-x. 3 root root 4096 5月  21 14:59 @BASEDIR@
drwxr-xr-x. 2 root root 4096 5月  21 14:58 @LOGDIR@
-rwxr--r--. 1 root root 1175 7月  20 2016 sqoop2-server
-rwxr--r--. 1 root root 1166 7月  20 2016 sqoop2-shell
-rwxr--r--. 1 root root 1163 7月  20 2016 sqoop2-tool
-rwxr--r--. 1 root root 7171 7月  20 2016 sqoop.sh
-rwxr--r--. 1 root root  942 7月  20 2016 sqoop-sys.sh
[root@linux01 bin]#
[root@linux01 bin]# ./sqoop.sh server start
Setting conf dir: ./../conf
Sqoop home directory: /root/apps/sqoop-1.99.7-bin-hadoop200
Starting the Sqoop2 server...
0    [main] INFO  org.apache.sqoop.core.SqoopServer  - Initializing Sqoop server.
102  [main] INFO  org.apache.sqoop.core.PropertiesConfigurationProvider  - Starting config file poller thread
Sqoop2 server started.
[root@linux01 bin]# 
  1. 进入sqoop客户端
[root@linux01 bin]# ./sqoop2-shell
Setting conf dir: /root/apps/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /root/apps/sqoop-1.99.7-bin-hadoop200
Sqoop Shell: Type 'help' or '\h' for help.

sqoop:000> 
  1. 设置服务(linux01是hadoop的主节点)

sqoop:000> set server --host linux01--port 12000 --webapp sqoop
Usage: set server
sqoop:000> 
  1. 设置日志输出级别
sqoop:000> show option --name verbose
Verbose = false
sqoop:000> set option --name verbose --value true
Verbose option was changed to true
sqoop:000> show option --name verbose
Verbose = true
sqoop:000> 
  1. 创建mysql link
sqoop:000> create link -c generic-jdbc-connector
Creating link for connector with name generic-jdbc-connector
Please fill following values to create new link object
Name: mysqllink3

Database connection

Driver class: com.mysql.jdbc.Driver
Connection String: jdbc:mysql://linux01:3306/test
Username: root
Password: ******
Fetch Size: 
Connection Properties: 
There are currently 0 values in the map:
entry# 

SQL Dialect

Identifier enclose:  
New link was successfully created with validation status WARNING and name mysqllink3
sqoop:000> 
  1. 创建hdfs link
sqoop:000>  create link -c hdfs-connector 
Creating link for connector with name hdfs-connector
Please fill following values to create new link object
Name: hdfslink3

HDFS cluster

URI: hdfs://linux01:9000
Conf directory: /root/apps/hadoop-2.7.7/etc/hadoop
Additional configs:: 
There are currently 0 values in the map:
entry# 
New link was successfully created with validation status OK and name hdfslink3
sqoop:000> show link
+------------+------------------------+---------+
|    Name    |     Connector Name     | Enabled |
+------------+------------------------+---------+
| mysqllink3 | generic-jdbc-connector | true    |
| hdfslink3  | hdfs-connector         | true    |
+------------+------------------------+---------+
sqoop:000> 
  1. 创建job
sqoop:000> create job -f mysqllink3 -t hdfslink3
Creating job for links with from name mysqllink3 and to name hdfslink3
Please fill following values to create new job object
Name: mysqltohdfs3

Database source

Schema name: test
Table name: emp
SQL statement: 
Column names: 
There are currently 0 values in the list:
element# 
Partition column: 
Partition column nullable: 
Boundary query: 

Incremental read

Check column: 
Last value: 

Target configuration

Override null value: 
Null value: 
File format: 
  0 : TEXT_FILE
  1 : SEQUENCE_FILE
  2 : PARQUET_FILE
Choose: 0
Compression codec: 
  0 : NONE
  1 : DEFAULT
  2 : DEFLATE
  3 : GZIP
  4 : BZIP2
  5 : LZO
  6 : LZ4
  7 : SNAPPY
  8 : CUSTOM
Choose: 0
Custom codec: 
Output directory: /sqoopinput
Append mode: 

Throttling resources

Extractors: 2
Loaders: 2

Classpath configuration

Extra mapper jars: 
There are currently 0 values in the list:
element# 
New job was successfully created with validation status OK  and name mysqltohdfs3
sqoop:000> 
  1. 启动job

注意:启动job最好是观察yarn的进度,是否完成,否则如果没有提交完毕,直接查看hdfs里面可能没有数据。

这里不看yarn也可以,需要耐心等待一会,再去查看hdfs

sqoop:000>  start job -name mysqltohdfs3
Submission details
Job Name: mysqltohdfs3
Server URL: http://localhost:12000/sqoop/
Created by: root
Creation date: 2020-05-22 10:27:11 CST
Lastly updated by: root
External ID: job_1590107885418_0001
        http://linux01:8088/proxy/application_1590107885418_0001/
Source Connector schema: Schema{name= test . emp ,columns=[
        FixedPoint{name=empno,nullable=true,type=FIXED_POINT,byteSize=4,signed=true},
        Text{name=ename,nullable=true,type=TEXT,charSize=null}]}
2020-05-22 10:27:11 CST: BOOTING  - Progress is not available
sqoop:000> 
  1. 查看hdfs是否迁移成功
[root@linux01 bin]# hadoop fs  -ls /sqoopinput
-rw-r--r--   3 root supergroup         22 2020-05-22 10:28 /sqoopinput/02d7eab2-cf84-4b46-a134-adaa4dc11f4a.txt
-rw-r--r--   3 root supergroup          0 2020-05-22 10:28 /sqoopinput/d8142f84-2a12-427f-affd-0c59287a15cb.txt
[root@linux01 bin]# 
[root@linux01 bin]# hadoop fs -cat /sqoopinput/02d7eab2-cf84-4b46-a134-adaa4dc11f4a.txt
1,'zhangsan'
2,'lisi'
[root@linux01 bin]# 

中途遇到的坑

在创建link、job途中,由于是双向交互,万一有一个参数填写错误,将不能修改,这里最好的办法是重新创建一个新的link。和原来不能重名,否则会提示已经存在此link,但是show link命令里面没有此link,蛋疼。如果退出client,停止sqoop服务,然后重新创建就什么问题都没有了。

有关其他课程请参阅 juejin.cn/user/175884…