概述
Apache Hive数据仓库软件可以使用SQL方便地读取、写入和管理分布存储中的大型数据集。hive数据仓库工具能将结构化的数据文件映射为一张数据库表,并提供SQL查询功能,能将SQL语句转变成MapReduce任务来执行。提供了命令行工具和JDBC驱动程序来将用户连接到Hive。该框架由Facebook开源发起,用于解决海量结构化的数据统计问题,后来成为Apache的顶级孵化项目。
安装部署
一、前置准备
部署Hive前请先确保自己的JDK、MYSQL、HADOOP组件可用。以下分别提供部署指南
二、准备安装包与JDBC驱动
- hive-1.1.0-cdh5.16.2.tar.gz 下载地址_提取码:87ew
- mysql-connector-java-5.1.47.jar
[hadoop@xinxingdata001 software]$ tar -zxvf hive-1.1.0-cdh5.16.2.tar.gz -C ../app/
[hadoop@xinxingdata001 software]$ cd ../app/
[hadoop@xinxingdata001 app]$ ln -s hive-1.1.0-cdh5.16.2/ hive
三、添加Hive环境变量,配置conf文件,copy驱动类
配置环境变量
[hadoop@xinxingdata001 ~]$ vim ~/.bashrc
#set hive environment
export HIVE_HOME=/home/hadoop/app/hive
export PATH=$HIVE_HOME/bin:$PATH
[hadoop@xinxingdata001 ~]$ source ~/.bashrc
配置conf文件
[hadoop@xinxingdata001 ~]$ vim /home/hadoop/app/hive/conf/hive-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://xinxingdata:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value>
<description>the URL of the MySQL database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
</configuration>
[hadoop@xinxingdata001 conf]$ cp hive-env.sh.template hive-env.sh
[hadoop@xinxingdata001 conf]$ vim hive-env.sh
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/home/hadoop/app/hadoop
Hive配置参数: cwiki.apache.org/confluence/…
添加mysql驱动Jar包
[hadoop@xinxingdata001 software]$ cp mysql-connector-java-5.1.47.jar ../app/hive/lib/mysql-connector-java.jar
metastore + hiveserver2服务
[hadoop@xinxingdata001 software]$ nohup hive --service metastore > ~/log/metastore.log 2>&1 &
[hadoop@xinxingdata001 software]$ nohup hiveserver2 > ~/log/hiveserver2.log 2>&1 &
常用命令、基本知识、练习
常用命令
hive> !clear; //清屏
hive> exit; //退出
hive> show databases; //查看数据库
hive> show databases '%student%'; //查看数据库,筛选表明包含student的表
hive> create database IF NOT EXISTS xx_hive; //创建名为xx_hive的数据库
hive> create table student(id int, name string, age int ); //创建表
hive> use xx_hive; //进入到xx_hive数据库;
hive> set hive.cli.print.current.db //查看当前配置的value值;
hive> set hive.cli.print.current.db=true; //显性显示数据库名;
hive> set hive.cli.print.header=true; //是否打印查询输出中的列名;
hive> desc formatted student; //查看表结构命令
hive (default)> create database if not exists hive2 location '/hive/directory'; //创建数据库并指定hdfs存储路径
hive (hive2)> drop database if exists hive2 cascade; //删除数据库;加cascade为级联操作(慎用);
hive (default)> select current_database(); //查看当前所在数据库
hive> ALTER TABLE xinxing_emp2 RENAME TO xinxing_emp2_new; //表更名
//把sql语句中DROP更换为TRUNCATE,则只删除表数据保留表结构。
MSCK REPAIR TABLE emp_partition [ADD/DROP/SYNC PARTITIONS]; //刷新metastore信息
hive> set hive.fetch.task.conversion=0||1||2 //决定哪些操作要跑MR
hive> set hive.execution.engine //选择执行引擎MR||SPARK
hive (default)> select explode(字段) from tbls; //行转列函数
hive (default)> select split(字段名, "拆分分隔符") from tbls; //拆分函数
交互式命令行
[hadoop@xinxingdata001 conf]$ hive -e "select * from student";
[hadoop@xinxingdata001 conf]$ hive -f 指定sql文件;
常见数据类型、分隔符
数值类型:int float double bigint
字符串类型:string varchar char
布尔型:blloean
复杂数据类型: array、map、structs
常用分隔符:, \t
创建表
CREATE [EXTERNAL] TABLE table_name
[ (列名 数据类型 [column_constraint_specification] [COMMENT col_comment], ...[constraint_specification]) ]
[ROW FORMAT row_format]
[AS select_statement]
[LIKE existing_table_or_view_name]
[LOCATION hdfs_path];
*******************************************************************************************************************
EXTERNAL
//外部表
column_constraint_specification
:[ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
COMMENT col_comment
//打注释
constraint_specification
: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
[, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
[, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
//分隔符
AS select_statement
//把查询返回的数据当做当前创建表的数据。
[LIKE existing_table_or_view_name]
//拷贝existing_table_or_view_name表的数据结构
LOCATION hdfs_path
//导入hdfs_path路径下的文件数据
emp.txt
7369 SMITH CLERK 7902 1980-12-17 800.00 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 30
7521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 30
7566 JONES MANAGER 7839 1981-4-2 2975.00 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.00 30
7782 CLARK MANAGER 7839 1981-6-9 2450.00 10
7788 SCOTT ANA-LYST 7566 1987-4-19 3000.00 20
7839 KING PRESIDENT 1981-11-17 5000.00 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 30
7876 ADAMS CLERK 7788 1987-5-23 1100.00 20
7900 JAMES CLERK 7698 1981-12-3 950.00 30
7902 FORD ANA-LYST 7566 1981-12-3 3000.00 20
7934 MILLER CLERK 7782 1982-1-23 1300.00 10
8888 HIVE PROGRAM 7839 1988-1-23 10300.00
创建一张表且分隔符为 \t
create table xinxing_emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
创建表并拷贝表的数据结构
create table xinxing_emp2 like xinxing_emp;
建表时指定数据的HDFS路径
CREATE TABLE tbl_name like xinxing_emp LOCATION '/user/hive/warehouse/xinxing_emp';
创建表并拷贝表的数据结构以及数据
create table xinxing_emp3 as select * from xinxing_emp;
向表中加载数据
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
有[LOCAL]则为本地路径,无[LOCAL]则为HDFS路径 有[OVERWRITE]参数则覆盖,无则追加
加载本地数据到hive表中
LOAD DATA LOCAL INPATH '/home/hadoop/data/emp.txt' OVERWRITE INTO TABLE xinxing_emp;
(从查询里取出的数据)插入到hive表
基本语法:
INSERT [OVERWRITE] TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
> 带 [OVERWRITE] 则覆盖,不带 [OVERWRITE] 则追加
扩展语法(插入到多个表中 ):
INSERT [OVERWRITE] TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
插入本地数据到hive表中
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 [ROW FORMAT row_format] [STORED AS file_format] SELECT ... FROM ...
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/data/tmp/' select empno,ename from ruozedata_emp;
插入sql查询返回的数据到hive表中
insert into table emp2 select empno,ename,job,mgr,hiredate,sal,comm,deptno from ruozedata_emp;
查询内置方法
hive (default)> show functions; //查看所有内置方法
hive (default)> desc function [extended] lower; //查看使用方法以及示例