MySQL对group by的优化

315 阅读7分钟

group by操作

group by操作在没有合适的索引可用时,通常先扫描整个表提取数据并创建一个临时表,然后按照group by指定的列进行排序;在这个临时表里面,对于每一个group 分组的数据行来说是连续在一起的。

完成排序之后,就可以得到所有的groups 分组,并可以执行聚集函数(aggregate function)。

可以看到,在没有使用索引的时候,需要创建临时表和排序;那在执行计划的 Extra 额外信息中通常就会看到这些信息 Using temporary; Using filesort 出现 。

验证:

  1. Using index 表示直接访问索引就能够获取到所需要的数据(覆盖索引),不需要通过索引回表;
  2. Using where 使用了用where子句来过滤结果集。 表示MySQL服务器在存储引擎收到记录后进行“后过滤”(Post-filter),如果查询未能使用索引,Using where的作用只是提醒我们MySQL将用where子句来过滤结果集。这个一般发生在MySQL服务器,而不是存储引擎层。一般发生在不能走索引扫描的情况下或者走索引扫描,但是有些查询条件不在索引当中的情况下。Using where只是过滤元组,和是否读取数据文件或索引文件没有关系
  3. Using filesort MySQL中无法利用索引完成的排序操作称为“文件排序”,使用文件排序,使用非索引列进行排序时出现,非常消耗性能,尽量优化。
  4. Using temporary 使用了临时表,表示MySQL需要使用临时表来存储结果集,常见于排序和分组查询
  5. Using Index Condition: 在MySQL 5.6版本后加入的新特性(Index Condition Pushdown);会先条件过滤索引,过滤完索引后找到所有符合索引条件的数据行,随后用 WHERE 子句中的其他条件去过滤这些数据行;

Mysql 对group by操作的优化

dev.mysql.com/doc/refman/…

对于GROUP BY 子句,通用方法是全表扫描并创建一个新的临时表,其中每个组的所有行都是连续的,然后使用此临时表进行聚合计算。在某些情况下,MySQL 通过使用索引访问来避免创建临时表。

group by使用索引的前置条件,是所有的group by中的列,必须在同一个索引里面,但这只是前提条件,最终是不是使用索引还会受where条件和聚合函数的影响

MySQL会使用两种方式来做group by的优化:

  • 松索引扫描(Loose Index Scan)
  • 紧索引扫描(Tight Index Scan)

这两种方式的区别是,松索引扫描是在扫描的时候就进行聚合计算。而紧索引扫描会先进行一次范围扫描,然后聚合计算结果。

Loose Index Scan

当索引本身就能够覆盖到所有group的列的时候,MySQL就会使用松索引扫描。为什么叫松索引扫描呢?因为这种扫描方式只考虑索引中的部分key,而不是全部key。

如果使用了松索引扫描,你的查询语句explain会在Extra中显示:Using index for group-by (scanning)

松散扫描有以下限制:

  1. 查询必须是单表,跨表不行
  2. group by的列必须符合索引的最左匹配原则
  3. select中,只有MIN和MAX聚合函数能用,里面的列必须在索引中,并且必须紧邻在group by列的后面
  4. 除了MIN和MAX函数的参数外,查询中的列如果不在group by中,那where条件中这一列只能是“等于一个常量”,而如果在group by的列中,可以使用大于和小于等条件。
  5. 字符串的前缀索引不适用,比如c1 varchar(20),索引c1(10)就不行。

我们假设有一个表t1,有c1,c2,c3,c4列。有一个索引idx(c1,c2,c3)。那下面这些查询就可以使用松索引扫描:

SELECT c1, c2 FROM t1 GROUP BY c1, c2;
SELECT DISTINCT c1, c2 FROM t1;
SELECT c1, MIN(c2) FROM t1 GROUP BY c1;
SELECT c1, c2 FROM t1 WHERE c1 < const GROUP BY c1, c2;
SELECT MAX(c3), MIN(c3), c1, c2 FROM t1 WHERE c2 > const GROUP BY c1, c2;
SELECT c2 FROM t1 WHERE c1 < const GROUP BY c1, c2;
SELECT c1, c2 FROM t1 WHERE c3 = const GROUP BY c1, c2;

但下面这几种就不行:

# 限制3,只有MIN和MAX能用
SELECT c1, SUM(c2) FROM t1 GROUP BY c1;
# 限制2,gropu by里面不符合最左匹配
SELECT c1, c2 FROM t1 GROUP BY c2, c3;
# 限制4,c3不在group by的列中,且没有等于一个常量
SELECT c1, c3 FROM t1 GROUP BY c1, c2;

除了group by以外,松索引扫描还用于这几种情况的聚合函数:

  • AVG DISTINCT, SUM DISTINCT, COUNT DISTINCT
  • 在查询中没有GROUP BY 和 DISTINCT
  • 仍然有上面说的那些限制

比如下面两个sql是可以用松索引扫描的:

SELECT COUNT(DISTINCT c1), SUM(DISTINCT c1) FROM t1;

SELECT COUNT(DISTINCT c1, c2), COUNT(DISTINCT c2, c1) FROM t1;

Tight Index Scan

紧索引扫描会根据查询条件来决定是全索引扫描还是范围索引扫描。当条件不满足松散索引扫描时,尤其是不满足索引的最左匹配原则时,如果查询条件里面,有一些条件能够“填充”索引中的"缝隙",MySQL也能够利用索引排好序的能力,直接完成group。

同样是上面的表t1,下面这两种sql虽然不能使用松索引扫描,但可以使用紧索引扫描,避免全表扫描和创建临时表。

  1. 虽然group by c1,c3不符合索引的最左匹配规则,但因为c2等于了一个常量,相当于填充了“缝隙”,所以也有效。
SELECT c1, c2, c3 FROM t1 WHERE c2 = 'a' GROUP BY c1, c3;
  1. 虽然group by c2, c3没有从c1开始,同样不符合索引的最左匹配规则,但c1等于了一个常量,也填充了“缝隙”。
SELECT c1, c2, c3 FROM t1 WHERE c1 = 'a' GROUP BY c2, c3;

回到工作

松索引扫描的性能是最高的,核心原理在于不扫描整棵索引树,只扫描必要的部分。其实除了group by以外,在正常的查询中,MySQL也可能使用松索引扫描来提升性能,是否使用松索引扫描,MySQL会根据数据的分布情况来测算,优化器来做最后的决定。

再回到我自己遇到的问题。我大概是有一个表,sql是这样:

explain select month_type_sort,
       departure_province_area,
       arrival_province_area
      , sum(total_calcWeight),
       sum(actual_capacity) as ratio,
       sum(un_filling_calc_weight)
from t_load_ratio_comp_temp
where month_short = '2022-06'
group by departure_province_area, arrival_province_area, month_type_sort;

把sum()注释掉:

create index idx_province_province_month
    on t_load_ratio_comp_temp (departure_province_area, arrival_province_area, month_type_sort, month_short,
                               total_calcWeight, actual_capacity, un_filling_calc_weight, week_type,
                               transport_line_type);
DROP INDEX idx_province_province_month ON t_load_ratio_comp_temp;

explain select month_type_sort,
       departure_province_area,
       arrival_province_area
#       , sum(total_calcWeight),
#        sum(actual_capacity) as ratio,
#        sum(un_filling_calc_weight)
from t_load_ratio_comp_temp
where month_short = '2022-06'
group by departure_province_area, arrival_province_area, month_type_sort;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt_load_ratio_comp_tempNULLrangeidx_province_province_monthidx_province_province_month372NULL1566100Using where; Using index for group-by

使用了松索引扫描,explain中在Extra显示:Using index for group-by (scanning)

不注释sum()

idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt_load_ratio_comp_tempNULLindexidx_province_province_monthidx_province_province_month1059NULL39342710Using where; Using index
  • 由于松索引扫描不支持SUM函数,所以这个SQL应该是走不了松索引扫描的。
  • 通过explain结果,发现确实走了索引,这里就涉及到另外一个知识点。联合索引和group by 之间的优化

联合索引和group by 之间的优化

create table t1(
id int primary key not null auto_increment,
c1 varchar(10),
c2 varchar(10),
c3 varchar(10),
c4 varchar(10),
c5 varchar(10));

insert into t1(c1,c2,c3,c4,c5) values ('a1','a2','a3','a4','a5');
insert into t1(c1,c2,c3,c4,c5) values ('b1','b2','b3','b4','b5');
insert into t1(c1,c2,c3,c4,c5) values ('c1','c2','c3','c4','c5');
insert into t1(c1,c2,c3,c4,c5) values ('d1','d2','d3','d4','d5');
insert into t1(c1,c2,c3,c4,c5) values ('e1','e2','e3','e4','e5');

create index idx_t1_c1234 on t1(c1,c2,c3,c4);

查看用1-4个索引的key_len情况

EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1';
EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' AND c2 = 'a2';
EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' AND c3 = 'a3';
EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' AND c3 = 'a3' AND c4 = 'a4';

结果分别为:43,86,129,172

  • 需要注意OEDER/GROUP BY如果使用了索引并不会显示在key_len中(覆盖索引除外),
  • 对于OEDER BY的优化主要看Extra字段有无出现Using filesort,如果可以利用索引排序,则不会出现filesort。
  • 对于GROUP BY的优化主要看Extra字段有无出现Using temporary,如果可以利用索引分组,则不会出现temporary

EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' ORDER BY c1;
EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' ORDER BY c2;
EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' ORDER BY c3;

上面3条语句explain的数据结果差不多,以第三条为例

idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLrefidx_t1_c1234idx_t1_c123486const,const1100NULL

均没有出现filersort。说明如果符合最左匹配原则,就可以使用索引排序

EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' ORDER BY c4;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLrefidx_t1_c1234idx_t1_c123486const,const1100Using filesort

根据key_len字段可以看出只使用了c1,c2索引。

EXPLAIN SELECT c1,c2,c3,c4 FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' ORDER BY c1;
EXPLAIN SELECT c1,c2,c3,c4 FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' ORDER BY c2;
EXPLAIN SELECT c1,c2,c3,c4 FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' ORDER BY c3;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLrefidx_t1_c1234idx_t1_c123486const,const1100Using index

Using index: 只走索引拿到了所有的数据,没有回表

EXPLAIN SELECT c1,c2,c3,c4 FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' ORDER BY c4;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLrefidx_t1_c1234idx_t1_c123486const,const1100Using index; Using filesort

EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' ORDER BY c2, c3;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLrefidx_t1_c1234idx_t1_c123443const1100NULL
EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' ORDER BY c2, c3;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLrefidx_t1_c1234idx_t1_c123486const,const1100NULL
EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' AND c2 = 'a2' ORDER BY c3, c2;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLrefidx_t1_c1234idx_t1_c123486const,const1100NULL
EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' ORDER BY c3, c2;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLrefidx_t1_c1234idx_t1_c123443const1100Using filesort

EXPLAIN SELECT * FROM t1 ORDER BY c1;
EXPLAIN SELECT * FROM t1 WHERE c2 = 'a2' ORDER BY c1;

这两条均出现filesort,说明如果没有WHERE语句或者有WHERE但不符合最左匹配,就会有filesort


EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' ORDER BY c2, c3;
EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' ORDER BY c2, c3 DESC;

上面无filesort,下面有filesort。说明降序OEDER BY也会导致索引失效


EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' ORDER BY c2;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLrefidx_t1_c1234idx_t1_c123443const1100NULL
EXPLAIN SELECT * FROM t1 WHERE c1 = 'a1' ORDER BY c2 DESC;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLrefidx_t1_c1234idx_t1_c123443const1100Backward index scan

两条均无filesort,但是下面有Backward index scan,说明如果仅对一个索引项进行倒序排序,那么也可以避免filesort,使用反向索引扫描

若把 SELECT * FROM 更改成 SELECT c1,c2,c3,c4 FROM呢?

EXPLAIN SELECT c1,c2,c3,c4 FROM t1 WHERE c2 = 'a2' ORDER BY c4;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLindexidx_t1_c1234idx_t1_c1234172NULL520Using where; Using index; Using filesort
EXPLAIN SELECT c1,c2,c3,c4 FROM t1 WHERE c2 = 'a2' ORDER BY c1;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLindexidx_t1_c1234idx_t1_c1234172NULL520Using where; Using index

EXPLAIN SELECT c1,c2,c3,c4 FROM t1 WHERE c2 = 'a2' ORDER BY c1;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLindexidx_t1_c1234idx_t1_c1234172NULL520Using where; Using index
EXPLAIN SELECT * FROM t1 WHERE c2 = 'a2' ORDER BY c1;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLALLNULLNULLNULLNULL520Using where; Using filesort

EXPLAIN SELECT c1,c2,c3,c4  FROM t1 WHERE c2 = 'a2' ORDER BY c1;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEt1NULLindexidx_t1_c1234idx_t1_c1234172NULL520Using where; Using index
EXPLAIN SELECT c1,c2,c3,c4  FROM t1 WHERE  c2 = 'a2' ORDER BY c2;
idselect_typetablepartitionstypepossible_keyskeykey_lenrefrowsfilteredExtra
1suSIMPLEt1NULLindexidx_t1_c1234idx_t1_c1234172NULL520Using where; Using index