Hive 使用<>'某个值' 时的坑

86 阅读1分钟

背景

统计一张表的用户人数

开始使用的SQL

select staticdate,
       count(distinct case when e_is_login_id = true then app_id end) `活跃用户`
from dws.dws_c_active_appstart
where staticdate = '2022-07-01'
  and e_utm_source <> 'miniWechat'
  and project in (1)
group by staticdate;

结果发现数据怎么都对不上,总是会少一点

问题原因

使用下边的SQL去看一下过滤的数据

select e_utm_source
from dws.dws_c_active_appstart
where staticdate = '2022-07-01'
  and e_utm_source <> 'miniWechat'
  and project in (1)
group by e_utm_source;
  • 结果发现:全部都是有效的值;
  • 但是e_utm_source字段肯定是存在空值和null值

使用下边的SQL验证一下

select e_utm_source
from dws.dws_c_active_appstart
where staticdate = '2022-07-01'
  and (e_utm_source <> 'miniWechat' or e_utm_source is null or e_utm_source = '')
  and project in (1)
group by e_utm_source;
  • 结果果然存在null值

结论

  • hive使用<> 过滤筛选时,会默认为需要不为空的值,自动过滤,可采用第二种方式结局