原文链接:www.gbase.cn/community/p…
更多精彩内容尽在南大通用GBase技术社区,南大通用致力于成为用户最信赖的数据库产品供应商。
本案例用来示例如何通过修改分片键来降低资源消耗,提高sql性能。
示例sql:
create table test(col int,id int,name text)
distribute by hash (col);
create table test_1(col int,id int,name varchar(64))
distribute by hash(name);
insert into test select 1,generate_series(1,100000), md5(random()::text);
insert into test select 64,generate_series(1,100000), md5(random()::text);
insert into test_1 select generate_series(1,100000),generate_series(1,100000), md5(random()::text);
执行计划如下:
explain analyze select * from test a join test_1 b on a.col=b.id ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Streaming(type: GATHER) (cost=13.29..29.25 rows=10 width=194) (actual time=106.033..1529.207 rows=200000 loops=1)
Spawn on: All datanodes
-> Hash Join (cost=13.29..28.64 rows=20 width=194) (Actual time: never executed)
Hash Cond: (a.col = b.id)
-> Streaming(type: BROADCAST) (cost=0.00..15.18 rows=40 width=40) (Actual time: never executed)
Spawn on: All datanodes
-> Seq Scan on test a (cost=0.00..13.13 rows=20 width=40) (Actual time: never executed)
-> Hash (cost=13.13..13.13 rows=21 width=154) (Actual time: never executed)
Buckets: 0 Batches: 0 Memory Usage: 0kB
-> Seq Scan on test_1 b (cost=0.00..13.13 rows=20 width=154) (Actual time: never executed)
Total runtime: 1562.160 ms
由于 test 表的分片键为col 字段,test_1表的分片键为name,而关联条件为 a.col=b.id。此时执行计划走的是stream,需要将表test 广播到 all datanodes(所有dn节点),即所有的dn节点都有一份表test,然后每个dn节点根据关联a.col=b.id条件 进行关联。所有dn节点完成关联后,返回给上层cn,即 streaming(Gather)。能看到本次执行,数据在dn节点是有交互、互相拉取,产生了额外的网络开销(此点可进行优化)。
优化点:消除dn节点互相拉取数据,网络交互带来的开销。即根据关联条件,将test_1表的分片键设置成 id 字段。
create table test(col int,id int,name text)
distribute by hash (col);
create table test_1(col int,id int,name varchar(64))
distribute by hash(id);
insert into test select 1,generate_series(1,100000), md5(random()::text);
insert into test select 64,generate_series(1,100000), md5(random()::text);
insert into test_1 select generate_series(1,100000),generate_series(1,100000), md5(random()::text);
执行计划如下:
postgres=# explain analyze select * from test a join test_1 b on a.col=b.id;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Data Node Scan (cost=0.00..0.00 rows=1000 width=194) (actual time=36.912..894.167 rows=200000 loops=1)
Node/s: All datanodes
Remote query: SELECT a.col, a.id, a.name, b.col, b.id, b.name FROM public.test a JOIN public.test_1 b ON a.col = b.id
-> Hash Join (cost=805.59..238675.59 rows=20775000 width=195)
Hash Cond: (a.col = b.id)
-> Seq Scan on test a (cost=0.00..3870.00 rows=200000 width=41)
-> Hash (cost=675.75..675.75 rows=20775 width=154)
-> Seq Scan on test_1 b (cost=0.00..675.75 rows=20775 width=154)
Total runtime: 955.638 ms
(9 rows)
能看到此时执行时间由原来的 1562.160 ms 提升到 955.638 ms。从执行计划看到,消除dn节点互相拉取数据,网络交互带来的开销,每个dn节点都只在本dn节点做关联查询,查询完成后返回给上层。
总结:此案例,利用关联条件,调整表的分片键,消除dn节点互相拉取数据,网络交互带来的开销,使其走fsq执行计划。
原文链接:www.gbase.cn/community/p…
更多精彩内容尽在南大通用GBase技术社区,南大通用致力于成为用户最信赖的数据库产品供应商。