hive-开窗函数与自定义函数开窗函数基本语法：函数(col_name) over (partition by 字段

开窗函数

基本语法：函数(col_name) over (partition by 字段 order by 字段 rows between 起始行 and 中止行)

函数：min，max，sum等

窗口范围

partition by 字段名 对数据分区，类似group by，默认是所有行
order by 字段名 对分区后的数据排序
窗口范围

  （1）窗口起点（可以不包含当前行）：
  （1）unbounded preceding：无边界的，可以理解为负无穷，表示表起点的第一行。
  （2）[num] preceding：num为数字，表示当前行的前 num 行作为起点。
  （3）current row：起点为当前行。
  （4）[num] following：表示当前行的后 num 行作为起点。

  （2）窗口终点（取决于起点方式）：
  （1）[num] preceding：当前行的前 num 行作为终点；当起点和终点都用 [num] preceding 时，起点的num值要比终点的大，保证终点在起点后面。
  （2）current row：当前行作为终点。
  （3）[num] following：当前行的后 num 行作为终点；当起点和终点都用 [num] following 时，起点的num值要小于终点的num值，保证终点在起点后面。
  （4）unbounded following：相当于正无穷，表示表里的最后一行作为终点。

例子： 查询截至到当前日期的该商品的销量和

select
    shop_id,
    sum(shop_cnt) over (partition by shop_id order by buy_time rows between unbounded preceding and current row) sum_cnt
from
    shop;

常用开窗函数

lag：取当前行的上n行，lag(字段名, 数字：上面n行, 默认值]) over(partition by 分区字段 order by 排序字段)
lead：取当前行的下n行，lead(字段名, 数字：下面n行, 默认值]) over(partition by 分区字段 order by 排序字段)
first_value：取窗口的第一个元素 first_value(字段名, false：是否跳过null值)
last_value：取窗口的最后一元素 last_value(字段名, false：是否跳过null值)

自定义函数（Java代码编写）

第一步：导入依赖

<dependencies>
   <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>3.1.3</version>
   </dependency>
</dependencies>

第二部继承GenericUDF实现抽象类方法

public class MyUDF extends GenericUDF {
    /**
     * 判断传进来的参数的类型和长度
     * 约定返回的数据类型
     */
    @Override
    public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {

        if (arguments.length !=1) {
            throw  new UDFArgumentLengthException("please give me  only one arg");
        }

        if (!arguments[0].getCategory().equals(ObjectInspector.Category.PRIMITIVE)){
            throw  new UDFArgumentTypeException(1, "i need primitive type arg");
        }

        return PrimitiveObjectInspectorFactory.javaIntObjectInspector;
    }

    /**
     * 解决具体逻辑的
     */
    @Override
    public Object evaluate(DeferredObject[] arguments) throws HiveException {
           
        // get方法才能获取到值
        Object o = arguments[0].get();
        if(o==null){
            return 0;
        }

        return o.toString().length();
    }

    @Override
    // 用于获取解释的字符串，可以不用实现
    public String getDisplayString(String[] children) {
        return "";
    }
}

第三步：打包上传到hdfs中第四步：创建函数

create function [database_nam.]my_len2 -- 函数名称
as "com.atguigu.hive.udf.MyUDF" -- 全类名
using jar "hdfs://hadoop102:8020/udf/myudf.jar";  -- jar包路径

注：创建的函数是指定库的，如果没有指定则是当前库，该函数只能在该库下使用

分区

在建表可以指定[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]，每一个分区在hadfs中都是对应一个文件夹，在查询时可以使用该字段进行过滤，如下建表例子

create table test(
    id int,
    name string
) partitioned by (age int)  -- 字段age作为分区字段

加载数据也需要指定分区

-- 例子1
load data local inpath '/opt/module/hive/datas/dept_20220401.log'
into table dept_partition
partition(day='20220401') ;

-- 例子2
insert overwrite table dept_partition **partition (day = '20220402')**
select deptno, dname, loc
from dept_partition
where day = '2020-04-01';

添加分区

-- 添加几个就写几个
alter table dept_partition
add partition(day='20220404') partition(day='20220405');

删除分区

alter table dept_partition
drop partition (day='20220404'), partition(day='20220405');

多级分区 多级分区类似于多级文件夹

load data local inpath '/opt/module/hive/datas/dept_20220401.log'
into table dept_partition2
partition(day='20220401', hour='12'), partition(day='20220401', hour='13');
-- 表示有一个分区day='20220401'，里面又包含两个分区hour='12'，hour='13'

动态分区

设置动态分区：set hive.exec.dynamic.partition=true，默认是false关闭的
严格模式动态分区：
- 动态分区的模式，默认strict（严格模式），要求必须指定至少一个分区为静态分区，nonstrict（非严格模式）允许所有的分区字段都使用动态分区。 set hive.exec.dynamic.partition.mode=nonstrict
一条insert语句可同时创建的最大的分区个数，默认为1000。

set hive.exec.max.dynamic.partitions=1000
单个Mapper或者Reducer可同时创建的最大的分区个数，默认为100。

set hive.exec.max.dynamic.partitions.pernode=100
一条insert语句可以创建的最大的文件个数，默认100000。

hive.exec.max.created.files=100000
当查询结果为空时且进行动态分区时，是否抛出异常，默认false。

hive.error.on.empty.partition=false

最后n个字段要分别对应hive分区的字段

set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)>
insert into table dept_partition_dynamic
partition(loc)
select
    deptno,
    dname,
    loc
from dept;

分桶表

hive (default)>
create table stu_buck(
    id int,
    name string
)
clustered by(id)
into 4 buckets
row format delimited fields terminated by '\t';

分桶排序表

hive (default)>
create table stu_buck_sort(
    id int,
    name string
)
clustered by(id) sorted by(id)
into 4 buckets
row format delimited fields terminated by '\t';