GBase8c sql性能优化之分片键 案例一

104 阅读3分钟

原文链接:www.gbase.cn/community/p…
更多精彩内容尽在南大通用GBase技术社区,南大通用致力于成为用户最信赖的数据库产品供应商。

本案例用来示例如何通过修改分片键来降低资源消耗,提高sql性能。

示例sql:

create table test(col int,id int,name text)
distribute by hash (col);
create table test_1(col int,id int,name varchar(64))
distribute by hash(name);
insert into test select 1,generate_series(1,100000), md5(random()::text);
insert into test select 64,generate_series(1,100000), md5(random()::text);

insert into test_1 select generate_series(1,100000),generate_series(1,100000), md5(random()::text);

执行计划如下:

explain analyze select * from test a join test_1 b on a.col=b.id ; 
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Streaming(type: GATHER)  (cost=13.29..29.25 rows=10 width=194) (actual time=106.033..1529.207 rows=200000 loops=1)
   Spawn on: All datanodes
   ->  Hash Join  (cost=13.29..28.64 rows=20 width=194) (Actual time: never executed)
         Hash Cond: (a.col = b.id)
         ->  Streaming(type: BROADCAST)  (cost=0.00..15.18 rows=40 width=40) (Actual time: never executed)
               Spawn on: All datanodes
               ->  Seq Scan on test a  (cost=0.00..13.13 rows=20 width=40) (Actual time: never executed)
         ->  Hash  (cost=13.13..13.13 rows=21 width=154) (Actual time: never executed)
                Buckets: 0  Batches: 0  Memory Usage: 0kB
               ->  Seq Scan on test_1 b  (cost=0.00..13.13 rows=20 width=154) (Actual time: never executed)
 Total runtime: 1562.160 ms

由于 test 表的分片键为col 字段,test_1表的分片键为name,而关联条件为 a.col=b.id。此时执行计划走的是stream,需要将表test 广播到 all datanodes(所有dn节点),即所有的dn节点都有一份表test,然后每个dn节点根据关联a.col=b.id条件 进行关联。所有dn节点完成关联后,返回给上层cn,即 streaming(Gather)。能看到本次执行,数据在dn节点是有交互、互相拉取,产生了额外的网络开销(此点可进行优化)。

优化点:消除dn节点互相拉取数据,网络交互带来的开销。即根据关联条件,将test_1表的分片键设置成 id 字段。

create table test(col int,id int,name text)
distribute by hash (col);
create table test_1(col int,id int,name varchar(64))
distribute by hash(id);
insert into test select 1,generate_series(1,100000), md5(random()::text);
insert into test select 64,generate_series(1,100000), md5(random()::text);

insert into test_1 select generate_series(1,100000),generate_series(1,100000), md5(random()::text);

执行计划如下:

postgres=# explain analyze select * from test a join test_1 b on a.col=b.id; 
                                                      QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
Data Node Scan  (cost=0.00..0.00 rows=1000 width=194) (actual time=36.912..894.167 rows=200000 loops=1)
  Node/s: All datanodes
  Remote query: SELECT a.col, a.id, a.name, b.col, b.id, b.name FROM public.test a JOIN public.test_1 b ON a.col = b.id
  ->  Hash Join  (cost=805.59..238675.59 rows=20775000 width=195)
        Hash Cond: (a.col = b.id)
        ->  Seq Scan on test a  (cost=0.00..3870.00 rows=200000 width=41)
        ->  Hash  (cost=675.75..675.75 rows=20775 width=154)
              ->  Seq Scan on test_1 b  (cost=0.00..675.75 rows=20775 width=154)
Total runtime: 955.638 ms
(9 rows)

能看到此时执行时间由原来的 1562.160 ms 提升到 955.638 ms。从执行计划看到,消除dn节点互相拉取数据,网络交互带来的开销,每个dn节点都只在本dn节点做关联查询,查询完成后返回给上层。

总结:此案例,利用关联条件,调整表的分片键,消除dn节点互相拉取数据,网络交互带来的开销,使其走fsq执行计划。

原文链接:www.gbase.cn/community/p…
更多精彩内容尽在南大通用GBase技术社区,南大通用致力于成为用户最信赖的数据库产品供应商。