Hive元数据表结构分析

3,059 阅读5分钟

Hive元数据表依赖关系

  • Hive数据库相关的元数据表(DBS、DATABASE_PARAMS)

  • Hive表和视图相关的元数据表(TBLS、TABLE_PARAMS、TBL_PRIVS通过TBL_ID关联)

  • Hive文件存储信息相关的元数据表(SDS、SD_PARAMS、SERDES、SERDE_PARAMS)

  • Hive表字段相关的元数据表(COLUMNS_V2、TBLS、TAB_COL_STATS)

  • Hive表分区相关的元数据表(PARTITIONS、PARTITION_KEYS、PARTITION_KEY_VALS、PARTITION_PARAMS)

  • HiveUDF函数相关表(FUNCS、FUNC_RU)

Hive元数据表职能分析(关键字段)

  • Hive版本:hive-1.1.0-cdh5.16.2
  • 元表数量:35张
  • Hive-2.X、3.X新增的元表(待续。。。)

VERSION

mysql> select * from VERSION;
+--------+----------------+-------------------+---------------------------------------+
| VER_ID | SCHEMA_VERSION | SCHEMA_VERSION_V2 | VERSION_COMMENT                       |
+--------+----------------+-------------------+---------------------------------------+
|      1 | 1.2.0          | NULL              | Set by MetaStore hadoop@172.18.12.249 |
+--------+----------------+-------------------+---------------------------------------+
1 row in set (0.09 sec)
  • 如果该表不存在,启动Hive客户端时,会提示Table ‘hive.version’ doesn’t exist

  • 该表如果存在多条数据,则会提示FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

DBS(数据库信息)

mysql> select * from DBS limit 5;
+-------+-----------------------+-----------------------------------------------------------+---------+------------+------------
| DB_ID | DESC                  | DB_LOCATION_URI                                           | NAME    | OWNER_NAME | OWNER_TYPE 
+-------+-----------------------+-----------------------------------------------------------+---------+------------+------------+
|     1 | Default Hive database | hdfs://xinxingdata001:9000/user/hive/warehouse            | default | public     | ROLE       |
|     6 | NULL                  | hdfs://xinxingdata001:9000/user/hive/warehouse/xinxing.db | xinxing | hadoop     | USER       |
+-------+-----------------------+-----------------------------------------------------------+---------+------------+------------+
2 rows in set (0.08 sec)
  • DB_ID:Hive数据库唯一ID
  • DESC:注释
  • DB_LOCATION_URI:数据库数据存储路径
  • NAME:数据库名
  • OWNER_NAME:数据库所有者用户名
  • OWNER_TYPE:所有者角色

TBLS(表信息)

mysql> select * from TBLS limit 5;
+--------+-------------+-------+------------------+--------+------------+-----------+-------+------------------+----------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | OWNER_TYPE | RETENTION | SD_ID | TBL_NAME         | TBL_TYPE       | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+--------+------------+-----------+-------+------------------+----------------+--------------------+--------------------+
|     22 |  1591951922 |     6 |                0 | hadoop | USER       |         0 |    22 | hive_student     | MANAGED_TABLE  | NULL               | NULL               |
|     31 |  1591952505 |     6 |                0 | hadoop | USER       |         0 |    41 | hive_ods_student | MANAGED_TABLE  | NULL               | NULL               |
|     46 |  1596364637 |     1 |                0 | hadoop | USER       |         0 |    56 | request_log      | EXTERNAL_TABLE | NULL               | NULL               |
|     51 |  1596783742 |     1 |                0 | hadoop | USER       |         0 |    61 | create_jd        | MANAGED_TABLE  | NULL               | NULL               |
|     52 |  1596783742 |     1 |                0 | hadoop | USER       |         0 |    62 | picked_jd        | MANAGED_TABLE  | NULL               | NULL               |
+--------+-------------+-------+------------------+--------+------------+-----------+-------+------------------+----------------+--------------------+--------------------+
5 rows in set (0.08 sec)
  • TBL_ID:Hive表唯一ID
  • CREATE_TIME:表创建日期(时间戳格式)
  • DB_ID:Hive数据库唯一ID
  • OWNER:表所有者用户名
  • OWNER_TYPE:所有者角色
  • RETENTION:保留字段
  • SD_ID:序列化配置信息
  • TBL_NAME:表名
  • TBL_TYPE:表类型
  • VIEW_EXPANDED_TEXT:视图的详细HQL语句
  • VIEW_ORIGINAL_TEXT:视图的原始HQL语句

TABLE_PARAMS(存储表/视图的属性信息)

mysql> select * from TABLE_PARAMS limit 10;
+--------+-----------------------+------------------------------------------+
| TBL_ID | PARAM_KEY             | PARAM_VALUE                              |
+--------+-----------------------+------------------------------------------+
|     22 | COLUMN_STATS_ACCURATE | true                                     |
|     22 | comment               | ???????                                  |
|     22 | numFiles              | 1                                        |
|     22 | numRows               | 10                                       |
|     22 | rawDataSize           | 273                                      |
|     22 | totalSize             | 283                                      |
|     22 | transient_lastDdlTime | 1592383477                               |
|     31 | COLUMN_STATS_ACCURATE | true                                     |
|     31 | comment               | Imported by sqoop on 2020/06/12 17:01:39 |
|     31 | numFiles              | 4                                        |
+--------+-----------------------+------------------------------------------+
10 rows in set (0.10 sec)
  • TBL_ID:Hive表唯一ID
  • PARAM_KEY:属性名
  • PARAM_VALUE:属性名对应的值

SERDES(序列化使用的类信息)

mysql> select * from SERDES limit 5;
+----------+------+----------------------------------------------------+
| SERDE_ID | NAME | SLIB                                               |
+----------+------+----------------------------------------------------+
|       22 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|       41 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|       56 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|       61 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|       62 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
+----------+------+----------------------------------------------------+
5 rows in set (0.10 sec)
  • SERDE_ID:序列化类ID
  • NAME:别名
  • SLIB:序列化类

SERDE_PARAMS(存储序列化的一些属性、格式信息)

mysql> select * from SERDE_PARAMS limit 10;
+----------+----------------------+-------------+
| SERDE_ID | PARAM_KEY            | PARAM_VALUE |
+----------+----------------------+-------------+
|       22 | field.delim          | ,           |
|       22 | serialization.format | ,           |
|       41 | field.delim          | ,           |
|       41 | line.delim           |				|
|       41 | serialization.format | ,           |
|       56 | field.delim          | ,           |
|       56 | serialization.format | ,           |
|       61 | field.delim          | ,           |
|       61 | serialization.format | ,           |
|       62 | field.delim          | ,           |
|       73 | serialization.format | 1           |
+----------+----------------------+-------------+
10 rows in set (0.11 sec)
  • SERDE_ID:序列化类配置ID
  • PARAM_KEY:属性名
  • PARAM_VALUE:属性值

SEQUENCE_TABLE(对于DB、TBL等的SEQUENCE_id ,每次新增则+1)

mysql> select * from SEQUENCE_TABLE limit 10;
+-----------------------------------------------------------+----------+
| SEQUENCE_NAME                                             | NEXT_VAL |
+-----------------------------------------------------------+----------+
| org.apache.hadoop.hive.metastore.model.MColumnDescriptor  |       96 |
| org.apache.hadoop.hive.metastore.model.MDatabase          |       11 |
| org.apache.hadoop.hive.metastore.model.MGlobalPrivilege   |        6 |
| org.apache.hadoop.hive.metastore.model.MPartition         |       31 |
| org.apache.hadoop.hive.metastore.model.MRole              |        6 |
| org.apache.hadoop.hive.metastore.model.MSerDeInfo         |      111 |
| org.apache.hadoop.hive.metastore.model.MStorageDescriptor |      111 |
| org.apache.hadoop.hive.metastore.model.MTable             |       96 |
| org.apache.hadoop.hive.metastore.model.MVersionTable      |        6 |
+-----------------------------------------------------------+----------+
9 rows in set (0.10 sec)
  • SEQUENCE_NAME:(例.MDatabase=11,则为该创建过11个数据库,MTable则为Hive总共创建过多少张表)
  • NEXT_VAL:value值

SDS(保存文件存储的基本信息)

提供table/partition对应的文件系统路径location,以及对这个数据读取的InputFormat、是否压缩、是否是子文件夹存储、SerDe类(对应于SERDES表)

  • SD_ID:TBLS表数据与该字段关联,则可知道,该表存储的基本信息。
  • CD_ID:字段信息ID,对应CDS表
  • INPUT_FORMAT:文件输入格式
  • IS_COMPRESSED:是否压缩
  • IS_STOREDASSUBDIRECTORIES:是否以子目录存储
  • LOCATION:表数据存储路径
  • NUM_BUCKETS:分桶数量,不分则为-1
  • OUTPUT_FORMAT:文件输出格式
  • SERDE_ID:对应SERDES

ROLES(Hive权限表)


mysql> select * from ROLES;
+---------+-------------+------------+-----------+
| ROLE_ID | CREATE_TIME | OWNER_NAME | ROLE_NAME |
+---------+-------------+------------+-----------+
|       1 |  1591862572 | admin      | admin     |
|       2 |  1591862572 | public     | public    |
+---------+-------------+------------+-----------+
2 rows in set (0.13 sec)
  • ROLE_ID:权限ID
  • CREATE_TIME:创建时间
  • OWNER_NAME:用户名
  • ROLE_NAME:用户权限

PARTITIONS(表分区信息)

mysql> select * from PARTITIONS;
+---------+-------------+------------------+---------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME     | SD_ID | TBL_ID |
+---------+-------------+------------------+---------------+-------+--------+
|      27 |  1597489789 |                0 | time=20200814 |   102 |     90 |
+---------+-------------+------------------+---------------+-------+--------+
1 row in set (0.10 sec)
  • PART_ID:分区ID
  • CREATE_TIME:创建时间
  • LAST_ACCESS_TIME:最后一次访问时间
  • PART_NAME:分区规则
  • SD_ID:???
  • TBL_ID:表ID

PARTITION_PARAMS(分区的属性信息)

mysql> select * from PARTITION_PARAMS;
+---------+-----------------------+-------------+
| PART_ID | PARAM_KEY             | PARAM_VALUE |
+---------+-----------------------+-------------+
|      27 | COLUMN_STATS_ACCURATE | true        |
|      27 | numFiles              | 2           |
|      27 | numRows               | 0           |
|      27 | rawDataSize           | 0           |
|      27 | totalSize             | 14655       |
|      27 | transient_lastDdlTime | 1598963992  |
+---------+-----------------------+-------------+
6 rows in set (0.10 sec)
  • PART_ID:分区ID
  • PARAM_KEY:属性名
  • PARAM_VALUE:属性值

PARTITION_KEYS(表分区的字段信息)

mysql> select * from PARTITION_KEYS;
+--------+--------------+----------------+-----------+-------------+
| TBL_ID | PKEY_COMMENT | PKEY_NAME      | PKEY_TYPE | INTEGER_IDX |
+--------+--------------+----------------+-----------+-------------+
|     46 | NULL         | partition_date | string    |           0 |
|     90 | NULL         | time           | string    |           0 |
+--------+--------------+----------------+-----------+-------------+
2 rows in set (0.10 sec)
  • TBL_ID:表ID
  • PKEY_COMMENT:分区字段说明
  • PKEY_NAME:表的分区字段名
  • PKEY_TYPE:表的分区字段类型
  • INTEGER_IDX:表的分区字段顺序,一张表可能会设置多个分区

PARTITION_KEY_VALS(存储分区字段值)

mysql> select * from PARTITION_KEY_VALS;
+---------+--------------+-------------+
| PART_ID | PART_KEY_VAL | INTEGER_IDX |
+---------+--------------+-------------+
|      27 | 20200814     |           0 |
+---------+--------------+-------------+
1 row in set (0.09 sec)
  • PART_ID:分区ID
  • PART_KEY_VAL:分区字段值
  • INTEGER_IDX:分区字段顺序

GLOBAL_PRIVS(权限表)

mysql> select * from GLOBAL_PRIVS;
+---------------+-------------+--------------+---------+--------------+----------------+----------------+-----------+
| USER_GRANT_ID | CREATE_TIME | GRANT_OPTION | GRANTOR | GRANTOR_TYPE | PRINCIPAL_NAME | PRINCIPAL_TYPE | USER_PRIV |
+---------------+-------------+--------------+---------+--------------+----------------+----------------+-----------+
|             1 |  1591862572 |            1 | admin   | ROLE         | admin          | ROLE           | All       |
+---------------+-------------+--------------+---------+--------------+----------------+----------------+-----------+
1 row in set (0.10 sec)

FUNCS(注册的UDF函数信息)

mysql> select * from FUNCS;
+---------+------------------------------------+-------------+-------+-----------+-----------+------------+------------+
| FUNC_ID | CLASS_NAME                         | CREATE_TIME | DB_ID | FUNC_NAME | FUNC_TYPE | OWNER_NAME | OWNER_TYPE |
+---------+------------------------------------+-------------+-------+-----------+-----------+------------+------------+
|       1 | xinxingdata.bigdata.UDF.AreaSplice |  1600421142 |     1 | geosplit  |         1 | NULL       | USER       |
+---------+------------------------------------+-------------+-------+-----------+-----------+------------+------------+
1 row in set (0.09 sec)
  • FUNC_ID:UDF_ID
  • CLASS_NAME:UDF函数jar包的主方法路径
  • CREATE_TIME:创建时间
  • DB_ID:数据库ID
  • FUNC_NAME:方法名
  • FUNC_TYPE:?
  • OWNER_NAME:用户名
  • OWNER_TYPE:角色

COLUMNS_V2(表对应的字段信息)

mysql> select * from COLUMNS_V2 limit 10;
+-------+---------+-------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+---------+-------------+-----------+-------------+
|    22 | NULL    | age         | int       |           2 |
|    22 | NULL    | create_date | string    |           4 |
|    22 | NULL    | id          | int       |           0 |
|    22 | NULL    | name        | string    |           1 |
|    22 | NULL    | sex         | string    |           3 |
|    31 | NULL    | age         | int       |           2 |
|    31 | NULL    | create_date | string    |           4 |
|    31 | NULL    | id          | int       |           0 |
|    31 | NULL    | name        | string    |           1 |
|    31 | NULL    | sex         | string    |           3 |
+-------+---------+-------------+-----------+-------------+
10 rows in set (0.10 sec)
  • CD_ID:对应TBLS表中TBL_ID
  • COMMENT:注释
  • COLUMN_NAME:表的字段名
  • TYPE_NAME:表的字段类型
  • INTEGER_IDX:字段顺序

CDS

mysql> select * from CDS limit 5;
+-------+
| CD_ID |
+-------+
|    22 |
|    31 |
|    46 |
|    51 |
|    52 |
+-------+
10 rows in set (0.10 sec)

TAB_COL_STATS

CREATE TABLE `TAB_COL_STATS` (
  `CS_ID` bigint(20) NOT NULL,
  `DB_NAME` varchar(128) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `TABLE_NAME` varchar(256) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `COLUMN_NAME` varchar(767) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `COLUMN_TYPE` varchar(128) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `TBL_ID` bigint(20) NOT NULL,
  `LONG_LOW_VALUE` bigint(20) DEFAULT NULL,
  `LONG_HIGH_VALUE` bigint(20) DEFAULT NULL,
  `DOUBLE_HIGH_VALUE` double(53,4) DEFAULT NULL,
  `DOUBLE_LOW_VALUE` double(53,4) DEFAULT NULL,
  `BIG_DECIMAL_LOW_VALUE` varchar(4000) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `BIG_DECIMAL_HIGH_VALUE` varchar(4000) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `NUM_NULLS` bigint(20) NOT NULL,
  `NUM_DISTINCTS` bigint(20) DEFAULT NULL,
  `AVG_COL_LEN` double(53,4) DEFAULT NULL,
  `MAX_COL_LEN` bigint(20) DEFAULT NULL,
  `NUM_TRUES` bigint(20) DEFAULT NULL,
  `NUM_FALSES` bigint(20) DEFAULT NULL,
  `LAST_ANALYZED` bigint(20) NOT NULL,
  PRIMARY KEY (`CS_ID`),
  KEY `TAB_COL_STATS_FK` (`TBL_ID`),
  KEY `TAB_COL_STATS_IDX` (`DB_NAME`,`TABLE_NAME`,`COLUMN_NAME`) USING BTREE,
  CONSTRAINT `TAB_COL_STATS_FK` FOREIGN KEY (`TBL_ID`) REFERENCES `TBLS` (`TBL_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
  • NUM_NULLS:自动空值数量
  • NUM_DISTINCTS:应该是去重后的个数、待验证
  • AVG_COL_LEN:字段value的平均长度
  • MAX_COL_LEN:字段value的最大长度