ClickHouse的GLOBAL RIGHT JOIN大坑分析案例分享
转载自微信公众号:mp.weixin.qq.com/s/ro5qaM8sl…
1.背景
最近收到反馈在我们数据平台--洞察的统计/看板的一些表中偶尔会出现内存不足的报错,如下:
类似的,在分层管理/任务管理中也有部分任务会因为内存不足而失败.
2.分析
首先查看该表/任务的sql, 经排查发现是由于主sql中,有一段子查询的内存占用特别高,在内存紧张的情况下往往会失败,该子查询如下:
with voided as (
select if(t.region_t is not null, t.region_t , 'all') as region_tmp,
if( t.countrycode_t is not null, t.countrycode_t , 'all') as countrycode_tmp,
if( t.app_id is not null , toString( t.app_id) , 'all') as app_id,
count(distinct user_id) as voided_user_cnt,
sum(amount) as voided_amount,
if(t.dt is not null,toString(t.dt),'all') as dt_tmp,
if(t.pay_method is not null,toString(t.pay_method),'all') as pay_method_tmp
from(select toNullable(if(mapping.region is not null, mapping.region, '其它')) as region_t,
toNullable(if(account.CountryCode is not null, account.CountryCode, '其它')) as countrycode_t,
toNullable(case when goods.device_type = '2' then 10018
when goods.device_type = '3' then 10019
when goods.device_type = '4' then 0
else 2
end) as app_id,
voided.user_id as user_id,
voided.amount as amount
,toNullable(voided.dt) as dt
,toNullable(voided.paymethod) as pay_method
from (select user_id,order_id,amount,toDate(create_at) as dt,paymethod
from sg_chikii.sv_usr_voidedOrder_distributed
WHERE create_at BETWEEN toDateTime(toStartOfMonth(toDateTime('$begin_time$'))) AND date_sub(date_add(toDateTime(toStartOfMonth(addMonths(toDateTime('$begin_time$'), 1))-1), interval 1 day ), interval 1 second )
) voided
global left join sg_chikii.sv_usr_account_distributed account on voided.user_id = toString(account.Id)
global left join sg_chikii.sv_usr_dimCountryMappingTbl_distributed mapping on account.CountryCode = mapping.code
global left join sg_chikii.sv_usr_goodsOrder_distributed goods on voided.order_id = goods.trade_no
) t
group by t.region_t, t.countrycode_t, t.app_id,t.dt,t.pay_method
with cube
having dt_tmp != 'all'
)
内存占用如下:
一共占用了10G,我们当前的ck一共也就16G,截止到当前的ck配置上的内存限制为14G,怪不得会执行失败
3.解决方案
现在知道出问题的sql,那么就有两个解决方案了:
第一个很简单,优化不行就生配置呗
第二个就是优化sql了。接下来就重点讲第二个方案。
既然是内存占用过高,那么就肯定是在内存中保存的数据太大了,看一下出问题的sql设计到的数据量:
表名/条数
sv_usr_voidedOrder_distributed : 288 sv_usr_dimCountryMappingTbl_distributed : 265 sv_usr_account_distributed : 53912447 sv_usr_goodsOrder_distributed : 61966255
可以看到account,goodsOrder表都是大表,难道ck把几个表都全量加载进内存了吗?先来看一下经过sql优化器优化过后的sql(使用ck的explain syntax):
SELECT
if(isNotNull(region_t), region_t, 'all') AS region_tmp,
if(isNotNull(countrycode_t), countrycode_t, 'all') AS countrycode_tmp,
if(isNotNull(app_id), toString(app_id), 'all') AS app_id,
uniqExact(user_id) AS voided_user_cnt,
sum(amount) AS voided_amount,
if(isNotNull(pay_method), toString(pay_method), 'all') AS pay_method_tmp
FROM
(
SELECT
toNullable(if(isNotNull(region), region, '其它')) AS region_t,
toNullable(if(isNotNull(CountryCode), CountryCode, '其它')) AS countrycode_t,
toNullable(multiIf(device_type = '2', 10018, device_type = '3', 10019, device_type = '4', 0, 2)) AS app_id,
`--voided.user_id` AS user_id,
`--voided.amount` AS amount,
toNullable(`--voided.paymethod`) AS pay_method
FROM
(
SELECT
`--voided.order_id`,
`--voided.paymethod`,
`--voided.dt`,
`--voided.amount`,
`--voided.user_id`,
Id,
CountryCode,
code,
region
FROM
(
SELECT
order_id AS `--voided.order_id`,
paymethod AS `--voided.paymethod`,
dt AS `--voided.dt`,
amount AS `--voided.amount`,
user_id AS `--voided.user_id`,
Id,
CountryCode
FROM
(
SELECT
user_id,
order_id,
amount,
toDate(create_at) AS dt,
paymethod
FROM sg_chikii.sv_usr_voidedOrder_distributed
WHERE (create_at >= toDateTime('2024-11-04')) AND (create_at <= toDateTime('2024-11-04'))
) AS voided
GLOBAL ALL LEFT JOIN sg_chikii.sv_usr_account_distributed AS account ON `--voided.user_id` = toString(Id)
) AS `--.s`
GLOBAL ALL LEFT JOIN sg_chikii.sv_usr_dimCountryMappingTbl_distributed AS mapping ON CountryCode = code
) AS `--.s`
GLOBAL ALL LEFT JOIN sg_chikii.sv_usr_goodsOrder_distributed AS goods ON `--voided.order_id` = trade_no
) AS t
GROUP BY
region_t,
countrycode_t,
app_id,
pay_method
WITH CUBE
HAVING dt_tmp != 'all'
明显的,我们的源sql被优化成了多个主查询-子查询的嵌套,同时还做了3个关键操作 global left join。
先来分析一下global关键字,在此贴上gpt的解释:
再来看一下上面的子查询有哪些, 对于第一个主查询-子查询的嵌套:
第一个:sg_chikii.sv_usr_goodsOrder_distributed 第二个:sg_chikii.sv_usr_dimCountryMappingTbl_distributed 第三个:sg_chikii.sv_usr_account_distributed
是的,对于上面被sql优化器优化过后的sql来说,子查询就是这3个表的全量数据,这情况内存不高才有鬼了。
既然知道原因了,那么我们调换一下主查询和子查询不就行了?直接把A global left join B 变成 B global right join A。来试一下(由于sv_usr_dimCountryMappingTbl_distributed数据量比较小就不转换这个了):
select if(region_t1 is not NULL,toString(region_t1), 'all') as region,
if(countrycode_t1 is not null,toString(countrycode_t1), 'all') as countrycode,
if(app_id_t1 is not NULL, toString(app_id_t1),'all') as app_id,
count(distinct user_id_t1) as voided_user_cnt,
sum(amount_t1) as voided_amount,
if(t.pay_method_t1 is not null,toString(t.pay_method_t1),'all') as pay_method_tmp
from(
select toNullable(if(mapping.region is not null, mapping.region, '其它')) as region_t1,
toNullable(tmp_1.country_code_t) as countrycode_t1,
toNullable(case when device_type_t = '2' then 10018
when device_type_t = '3' then 10019
when device_type_t = '4' then 0
else 2
end) as app_id_t1,
user_id_t as user_id_t1,
amount_t as amount_t1,
toNullable(pay_method_t) as pay_method_t1
from (
select DISTINCT
toNullable(voided_t.country_code) as country_code_t,
toNullable(voided_t.paymethod) as pay_method_t,
goods.device_type as device_type_t,
if(voided_t.paymethod = 9, voided_t.amount / 120 * 100, toFloat64(voided_t.amount)) as amount_t,
toNullable(voided_t.dt) as dt_t,
goods.trade_no as trade_no_t,
voided_t.order_id as order_id_t,
voided_t.user_id as user_id_t
from sg_chikii.sv_usr_goodsOrder_distributed goods
global right join
(
select
voided.user_id as user_id,
voided.order_id as order_id,
voided.amount as amount,
voided.paymethod as paymethod,
if(account.CountryCode = '' or account.CountryCode is null, 'other', account.CountryCode) as country_code,
voided.dt as dt
from sg_chikii.sv_usr_account_distributed account
global right join (
select user_id,order_id,amount,paymethod,toDate(create_at) as dt
from sg_chikii.sv_usr_voidedOrder_distributed
WHERE create_at between '2024-11-04' and '2024-11-04'
) voided on voided.user_id = toString(account.Id)
) voided_t on voided_t.order_id = goods.trade_no
)tmp_1
GLOBAL LEFT JOIN sg_chikii.sv_usr_dimCountryMappingTbl_distributed mapping on tmp_1.country_code_t = mapping.code
) t
group by t.region_t1, t.countrycode_t1, t.app_id_t1,t.pay_method_t1
with cube
内存占用:
33453606 / 1024 / 1024 / 1024 = 0.031G,直接给他干到了0.031G
接下来只要数据正确就完美了: 原数据:
新sql的数据:
可以看到数据对不上了,后续查明是 global 加 right join 的问题,这个等下再讲,先贴一下在不能使用global right join 的情况下的解决方案:
t1 left join t2的语义返回左表(t1)中的所有记录以及右表(t2)中满足查询条件的记录,如果右表中没有满足条件的记录,则置null
所以
select t1.id as id1
,t2.id as id2
from t1 left join t2 on t1.id = t2.id
可以改写成:
select id1, id2
from (
select t1.id as id1
,t2.id as id2
from t1 inner join t2 on t1.id = t2.id
union all
select t1.id as id1
,null as id2
from t1
where t1.id not in (
select id
from t2
where t2.id in (
select t1.id
from t1
)
)
)
最终编写出的sql如下(数据已验证正确):
with voided as (
select user_id,order_id,amount,paymethod,toDate(create_at) as dt
from sg_chikii.sv_usr_voidedOrder_distributed
WHERE create_at between '$begin_time$' and '$end_time$'
),voided_tmp as (
select
v1.user_id as user_id,
v1.order_id as order_id,
v1.amount as amount,
v1.paymethod as paymethod,
if(account.CountryCode = '' or account.CountryCode is null, 'other', account.CountryCode) as country_code
from sg_chikii.sv_usr_account_distributed account
global inner join voided v1 on v1.user_id = toString(account.Id)
union all
select
v1.user_id as user_id,
v1.order_id as order_id,
v1.amount as amount,
v1.paymethod as paymethod,
null
from voided v1
where v1.user_id global not in (
select toString(account.Id) as id
from sg_chikii.sv_usr_account_distributed account
where toString(account.Id) global in (
select user_id from voided
)
)
),tot_t as (
select toNullable(if(mapping.region is not null, mapping.region, '其它')) as region_t1,
toNullable(tt.country_code_t) as countrycode_t1,
toNullable(case when device_type_t = '2' then 10018
when device_type_t = '3' then 10019
when device_type_t = '4' then 0
else 2
end) as app_id_t1,
user_id_t as user_id_t1,
amount_t as amount_t1,
toNullable(pay_method_t) as pay_method_t1
from (
select DISTINCT
toNullable(voided_t.country_code) as country_code_t,
toNullable(voided_t.paymethod) as pay_method_t,
goods.device_type as device_type_t,
if(voided_t.paymethod = 9, voided_t.amount / 120 * 100, toFloat64(voided_t.amount)) as amount_t,
goods.trade_no as trade_no_t,
voided_t.order_id as order_id_t,
voided_t.user_id as user_id_t
from sg_chikii.sv_usr_goodsOrder_distributed goods
global inner join voided_tmp voided_t on voided_t.order_id = goods.trade_no
union all
select
toNullable(voided_t.country_code) as country_code_t,
toNullable(voided_t.paymethod) as pay_method_t,
null as device_type_t,
if(voided_t.paymethod = 9, voided_t.amount / 120 * 100, toFloat64(voided_t.amount)) as amount_t,
null as trade_no_t,
voided_t.order_id as order_id_t,
voided_t.user_id as user_id_t
from voided_tmp voided_t
where voided_t.order_id global not in (
select trade_no
from sg_chikii.sv_usr_goodsOrder_distributed
where trade_no global in (
select order_id from voided_tmp
)
)
) tt
GLOBAL LEFT JOIN sg_chikii.sv_usr_dimCountryMappingTbl_distributed mapping on tt.country_code_t = mapping.code
)
select if(region_t1 is not NULL,toString(region_t1), 'all') as region,
if(countrycode_t1 is not null,toString(countrycode_t1), 'all') as countrycode,
if(app_id_t1 is not NULL, toString(app_id_t1),'all') as app_id,
count(distinct user_id_t1) as voided_user_cnt,
sum(amount_t1) as voided_amount,
if(pay_method_t1 is not null,toString(pay_method_t1),'all') as pay_method_tmp
from tot_t t
group by t.region_t1, t.countrycode_t1, t.app_id_t1,t.pay_method_t1
with cube
order by region, countrycode, app_id, pay_method_tmp
占用内存:
512429025 / 1024 / 1024 /1024 = 0.477G,大概0.5g,完全可以接受了
4. t1 global left join t2 != t2 global right join t1
现在再来说说为什么不能使用global right join。(其实如果ck集群只有一个节点的话是完全没有问题的)
前面说到GLOBAL关键字会让ck执行以下步骤: 解析查询->执行子查询->广播结果→结果聚合
问题就出在“执行子查询”和“结果聚合”这两步上。
首先我们都知道在使用分布式方案上的ck是可以分成两种表的:本地表,分布表。本地表只有当前节点的数据,分布表有全部节点的数据。
使用global关键字之后的“执行子查询”这一步的主查询是使用本地表的,现在假设有A,B两个节点,那么当在A节点执行字查询的时候,t2中与t1匹配的记录是在B节点上的,这时候就会返回 t2.null,t1.id这种记录,但是在整体上看真正的数据应该是t2.id,t1.id这样的。
下面是一个例子:
A节点有t1表的{id:1}记录,有t2表的{id=2}记录
B节点有t1表的{id:2}记录,有t2表的{id=1}记录
现在执行
select t1.id as id1
,t2.id as id2
from t2 global right join t1 on t1.id = t2.id
如果不是分布式数据的话,正确应该得到以下数据
{id1= 1,id2 = 1},
{id1=2,id2=2}
那么当ck是分布式的情况呢?
A节点: 由于A节点的主查询既t2表只有{id=2}的记录,所以会得到
{id1=1,id2 = null},
{id1=2,id2 = 2}
B节点:同理,得到:
{id1=1,id2 = 1},
{id1=2,id2 = null}
现在再执行到“结果聚合”的步骤,就会得到4条记录
{id1=1,id2 = null},
{id1=2,id2 = 2},
{id1=1,id2 = 1},
{id1=2,id2 = null}
所以,经过上述例子的分析就明白为什么 t1 global left join t2 != t2 global right join t1 了。