Hive从入门到放弃——Hive 用户自定义函数之UDF简介(十二)

1,443 阅读4分钟

==背 景==

  Hive拥有强大的内置函数来实现数据处理和数据分析,但是有时特定的需求逻辑,或者是复用率高的处理逻辑,单纯的内置函数处理起来回非常麻烦,甚至无法处理,这个时候,就需要用到Hive自定义的函数了,Hive自定义函数总体分为三种;

  • udf(user defined function 用户定义函数):传入一个值,逻辑运算后返回一个值,如内置函数的floor,round等一样的计算方式;
  • udaf(user aggregation function 用户自定义聚合函数):传入多行数据,根据选定的值group by后返回一行结果,跟sum,count类似;
  • udtf(user table-generator function 用户自定义表生成函数):基于特定的一行值输入,返回展开多行输出,跟内置函数的explode类似;

  构建UDF函数的步骤;

  1. 继承UDF或者UDAF或者UDTF相关的基类或接口,实现需求需要的逻辑的方法。
  2. 将写好的类打包为jar,并最好上传到hdfs。
  3. 进入到Hive客户端,利用 CREATE FUNCTION db_name.your_function_name AS 'your_jar_package_name.your_jar_class_name' USING JAR 'hdfs:///your_jar_hdfs_path/your_jar.jar 创建你的UDF函数:

db_name:你的hive某个数据库的名字; your_function_name:你给函数取个名字; your_jar_package_name:你Java代码实现UDF的package名; your_jar_class_name:你Java代码实现UDF的class名; hdfs:///your_jar_hdfs_path/your_jar.jar:hdfs上存的Jar路径; 其它:Hive创建函数的关键字;

  1. 在hive客户端中使用select your_function_name (cols)测试自定义函数功能;

  该篇博客先讲解下用的最多的UDF的实现,下面进入实战环节;

==Hive UDF实战==

==需求背景==

  本人有一批设备数据,有个字段叫画面比例如16:9,4:3等,但是数据在前端采集的时候做的不够好,会出现16:9,9:16这种情况,数据使用非常不方便,玉于是数仓决定清洗下数据,定下逻辑全部设备的画面比例为大:小,即只会出现16:9和4:3;

==Java编程==

  个人比较细化使用IDEA编写Java,如图1,就是构建一个基于Maven框架的Java项目来实现UDF; 在这里插入图片描述

图1 Java代码实现需求逻辑的UDF

  其中实现的逻辑的Java代码如下,相对简单,不用多解释吧?

package org.example;

import org.apache.hadoop.hive.ql.exec.UDF;

public class PicRatio_UDF extends UDF {
    public String evaluate(String picratio)
    {
        try
        {
            String[] myArray=picratio.split(":");
            int myArray0=Integer.parseInt(myArray[0]);
            int myArray1=Integer.parseInt(myArray[1]);
            return myArray0<myArray1?myArray1+":"+myArray0:picratio;
        }
        catch (Exception e)
        {
            System.out.println(e);
            return picratio;
        }
    }
}

  严谨起见,还是写一段主函数测试下我们的Java方法是否是对的,用例测试Java代码如下;

package org.example;

public class Test_UDF {
    public static void main(String[] args)
    {
        String source_PicRatio="3:4";
        PicRatio_UDF myPicRatio_UDF=new PicRatio_UDF();
        String myPicRatio=myPicRatio_UDF.evaluate(source_PicRatio);
        System.out.println(myPicRatio);
    }
}

  项目是基于Maven框架的,用到的Maven依赖和插件等在pom.xml文件内,内容如下;

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>org.example</groupId>
  <artifactId>udf</artifactId>
  <version>1.0-SNAPSHOT</version>

  <name>udf</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.8.5</version>
    </dependency>

    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-common</artifactId>
      <version>2.3.5</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>2.3.5</version>
    </dependency>

  </dependencies>
  
<!--一定要加上,不然容易报错Failure to find org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde -->
  <repositories>
    <repository>
      <id>spring-plugin</id>
      <url>https://repo.spring.io/plugins-release/</url>
    </repository>
  </repositories>


  <build>
    <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
      <plugins>
        <!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
        <plugin>
          <artifactId>maven-clean-plugin</artifactId>
          <version>3.1.0</version>
        </plugin>
        <!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
        <plugin>
          <artifactId>maven-resources-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.8.0</version>
        </plugin>
        <plugin>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.22.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-jar-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-install-plugin</artifactId>
          <version>2.5.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-deploy-plugin</artifactId>
          <version>2.8.2</version>
        </plugin>
        <!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
        <plugin>
          <artifactId>maven-site-plugin</artifactId>
          <version>3.7.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-project-info-reports-plugin</artifactId>
          <version>3.0.0</version>
        </plugin>
      </plugins>
    </pluginManagement>
    <plugins>
      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
          <archive>
            <manifest>
              <mainClass></mainClass>
            </manifest>
          </archive>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

  一切都准备Ok后利用Maven打成Jar包,打包可参考博客IntelliJ IDEA将代码打成Jar包的方式,注意,一定要确保打包Jar的时候日志输出是准确的,没有报错才行,而不是单纯看到有Jar包生成了就觉得包打好了,如果你不看打包报错日志,单纯去copy走生成的错误的包,后面hive引用该包还是会报错,你还的改然后重新打包,然后将该包上传到hdfs上,如我把它上传到hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar;   万事具备,现在开始打开hive的客户端,创建永久的UDF函数,Hive Cli环境下操作如下;

hive> create function  rowyet.picratio as 'org.example.PicRatio_UDF' using jar 'hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar';
Added [/opt/hive/log/c0017eab-0d8b-49bd-8157-09237071085c_resources/udf-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
Time taken: 6.03 seconds

--测试
hive> select rowyet.picratio("4:3")
    > ;
OK
4:3
Time taken: 0.902 seconds, Fetched: 1 row(s)
hive> select rowyet.picratio("3:4");
OK
4:3
Time taken: 0.064 seconds, Fetched: 1 row(s)

  也可以临时性添加函数,具体HiveQL如下,一般不常用,都是建立永久的居多;

create temporary function  rowyet.picratio as 'org.example.PicRatio_UDF' using jar 'hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar';

==异常分析==

  1. 如果连加载Jar都报错,未出现以下加载Jar的日志的语句就报错了; 如报错FAILED: SemanticException java.lang.IllegalArgumentException: java.net.UnknownHostException: hive,说明hive客户端内没有加载到Jar包,请检查你的Jar路径对不对,是否多了少了/
Added [/opt/hive/log/c0017eab-0d8b-49bd-8157-09237071085c_resources/udf-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
  1. 如果加载Jar显示有日志,而是报错Failed to register rowyet.picratio using class com.ruoyin.hiveudf.PicRatio_UDF这种类似的,说明是Jar打包的有问题,Jar内存在问题,请回去看打包日志,是否存在错误,修复掉即可。

==删除自定义UDF函数==

  也支持删除自定义得UDF,具体HiveQL如下;

hive> drop function rowyet.picratio;
Added [/opt/hive/log/ce6a5159-2063-4796-b4cb-5a6762b0ad15_resources/udf-1.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hdfs:///hive/hive_udf_jar/udf-1.0-SNAPSHOT-jar-with-dependencies.jar]
OK
Time taken: 0.752 seconds
hive> select rowyet.picratio("4:3")
    > ;
FAILED: SemanticException [Error 10011]: Invalid function rowyet.picratio