Hive 递归查询树状结构 递归遍历树

1,827 阅读2分钟

需求

hive中有一张表存储所有省市地区信息或者公司组织信息, 查询所有最末枝的地区/组织,如下所示, 查询结果应该是沙河镇, 马池口镇, 中关村,上地,济南市

北京市
   昌平区
       沙河镇
       马池口镇
    海淀区
        中关村
        上地
山东省
    济南市
......

创建hive模拟表

create table area (id int, name string, parent_id int);

插入数据

with area_info as
 (select 1 as id, '北京市'     as name, 0 as parent_id union all
  select 2 as id, '山东省'     as name, 0 as parent_id union all
  select 3 as id, '昌平区'     as name, 1 as parent_id union all
  select 4 as id, '海淀区'     as name, 1 as parent_id union all
  select 5 as id, '沙河镇'     as name, 3 as parent_id union all
  select 6 as id, '马池口镇'    as name, 3 as parent_id union all
  select 7 as id, '中关村'     as name, 4 as parent_id union all
  select 8 as id, '上地'       as name, 4 as parent_id union all
  select 9 as id, '烟台市'     as name, 2 as parent_id union all
  select 10 as id, '即墨区'    as name, 9 as parent_id union all
  select 11 as id, '牟平区'    as name, 9 as parent_id union all
  select 12 as id, '济南市'    as name, 2 as parent_id
)
insert overwrite table area select * from area_info;

重点是SQL

通过Hive的CTE实现组织树的遍历

  1. 假如有N层架构, 就要写N-1个CTE子句
  2. 需要注意保留没有子节点的父节点
with p1 as (select t1.* from area t1 --子
join area t2 --父
on t1.parent_id = t2.id 
where t2.name in ('北京市', '山东省')),
p2 as (
--所有子节点
select t3.* from area t3 --子
 join p1 --父
on t3.parent_id = p1.id
union all
--没有子节点的父节点
select p1.* from area t3 --子
 right join p1 --父
on t3.parent_id = p1.id
where t3.id is null
)
select * from p2 ;

结果

p2.idp2.namep2.parent_id
7中关村4
8上地4
5沙河镇3
6马池口镇3
10即墨区9
11牟平区9
12济南市2

以上, 如有更好的写法, 请评论, 大家一起交流。