Mysql Limit问题的思考

325 阅读5分钟

limit的用法和作用

Mysql 中的limit子句主要用来做分页,比如需要展现一个结果集中第100到第120的数据记录,具体的使用语法为 limit M,N。其中的M用于表示便宜量也就是从第几条开始,N用来表示去多少行。

Limit带来的问题

在数据量较小或者是中等的时候,使用limit可以很方便的解决分页的问题,但是如果在数据量较大,尤其是在M这个偏移量较大的时候,那么limit子句的性能就会明显的下降。可以看下面这个例子:

CREATE TABLE `member` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(10) NOT NULL COMMENT '姓名',
  `gender` tinyint(3) unsigned NOT NULL COMMENT '性别',
  PRIMARY KEY (`id`),
  KEY `gender` (`gender`)
);
mh01@3306>desc member;
+--------+---------------------+------+-----+---------+----------------+
| Field  | Type                | Null | Key | Default | Extra          |
+--------+---------------------+------+-----+---------+----------------+
| id     | int(10) unsigned    | NO   | PRI | NULL    | auto_increment |
| name   | varchar(10)         | NO   |     | NULL    |                |
| gender | tinyint(3) unsigned | NO   | MUL | NULL    |                |
+--------+---------------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)
mh01@3306>select count(*) from member;
+----------+
| count(*) |
+----------+
|  2000000 |
+----------+
1 row in set (0.27 sec)


上面是一个200W的表,现在通过设置不同的limit偏移量来展示不同偏移量对性能的影响。

mh01@3306>select * from member where gender=1 limit 10,2;
+----+------------+--------+
| id | name       | gender |
+----+------------+--------+
| 21 | d2fe4476b8 |      1 |
| 24 | afd9592b48 |      1 |
+----+------------+--------+
2 rows in set (0.00 sec)

mh01@3306>select * from member where gender=1 limit 200000,2;
+--------+------------+--------+
| id     | name       | gender |
+--------+------------+--------+
| 399604 | 673869a827 |      1 |
| 399605 | fe46b99c08 |      1 |
+--------+------------+--------+
2 rows in set (0.18 sec)

mh01@3306>select * from member where gender=1 limit 800000,2; 
+---------+------------+--------+
| id      | name       | gender |
+---------+------------+--------+
| 1599363 | e1bde81632 |      1 |
| 1599364 | c6f694dc1e |      1 |
+---------+------------+--------+
2 rows in set (1.14 sec)

从上面的展示中可以发现,当偏移量增加的时候,同样是查询2行,性能是越来越差的,为什么会出现上面的这种情况?这个就需要从limit本身的执行方式说起了。

  • 首先通过gender索引获取gender=1的值,获取主键的值。
  • 通过第一步获取到的主键回表去查询其它的数据,因为在gender这个索引上面是只有gender的值和主键值的,其它的数据都需要通过回表查询。
  • 抛弃offset前的数据,也就是说,虽然我们只是取出2条数据,但是800000,2我们就需要处理了8000002条数据。而且在处理的过程中产生了大量的随机读。

验证

可以通过两个方面去验证,首先可以通过information_schema.INNODB_BUFFER_PAGE这个视图去查看缓存的情况,也可以通过sys.io_global_by_file_by_bytes 去查看对应ibd文件被读取的情况

这个是重启数据库后的初始情况
mh01@3306>select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in('primary','gender') and TABLE_NAME like '%member%' group by index_name;
+------------+----------+
| index_name | count(*) |
+------------+----------+
| PRIMARY    |      399 |
+------------+----------+

mh01@3306>select * from test.member where gender=1 limit 800000,2;
+---------+------------+--------+
| id      | name       | gender |
+---------+------------+--------+
| 1599363 | e1bde81632 |      1 |
| 1599364 | c6f694dc1e |      1 |
+---------+------------+--------+
2 rows in set (2.02 sec)

发现在执行了800000,2之后,primary的数据明显增加
mh01@3306>select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in('primary','gender') and TABLE_NAME like '%member%' group by index_name;
+------------+----------+
| index_name | count(*) |
+------------+----------+
| gender     |      577 |
| PRIMARY    |     3687 |
+------------+----------+
2 rows in set (0.23 sec)

使用io_global_by_file_by_bytes也能发现开销真不小
mh01@3306>select count_read from sys.io_global_by_file_by_bytes where file like '%member.ibd%';
+------------+
| count_read |
+------------+
|       1351 |
+------------+
1 row in set (0.19 sec)

mh01@3306>select count_read from sys.io_global_by_file_by_bytes where file like '%member.ibd%';
+------------+
| count_read |
+------------+
|       4266 |
+------------+
1 row in set (0.20 sec)

解决方式

既然已经知道了问题参数的原因,那么就不能知道解决的办法了。可以通过覆盖索引的方式先找到对应的主键,就相当于直接告诉数据库,你不需要去一条条的算offset了,直接告诉数据库第8000000行的主键id就好。

通过join的方式
mh01@3306>select a.* from member as a inner join (select id from member where gender=1 limit 800000,2) as b on a.id=b.id; 
+---------+------------+--------+
| id      | name       | gender |
+---------+------------+--------+
| 1599363 | e1bde81632 |      1 |
| 1599364 | c6f694dc1e |      1 |
+---------+------------+--------+
2 rows in set (0.15 sec)

普通的方式
mh01@3306>select * from test.member where gender=1 limit 800000,2;        +---------+------------+--------+
| id      | name       | gender |
+---------+------------+--------+
| 1599363 | e1bde81632 |      1 |
| 1599364 | c6f694dc1e |      1 |
+---------+------------+--------+
2 rows in set (0.93 sec)

通过用join的方式能大大的减少性能的开销,主要是通过select id from member where gender=1 limit 800000,2的方式获取了需要记录的主键。整个过程就变成了,先通过gender索引获取对应的id,因为在gender上面已经记录了对应的id值,不需要回表也就不会产生大量的随机IO,就算处理了800000条记录性能依然非常好。在这一部分最终获取了2个id,最后只需要对着2个id做回表查询就好。大大的提高了性能。

对比一下性能的开销

mh01@3306>select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in('primary','gender') and TABLE_NAME like '%member%' group by index_name;
+------------+----------+
| index_name | count(*) |
+------------+----------+
| gender     |      246 |
| PRIMARY    |      208 |
+------------+----------+
2 rows in set (0.40 sec)

mh01@3306>select a.* from member as a inner join (select id from member where gender=1 limit 800000,2) as b on a.id=b.id;
+---------+------------+--------+
| id      | name       | gender |
+---------+------------+--------+
| 1599363 | e1bde81632 |      1 |
| 1599364 | c6f694dc1e |      1 |
+---------+------------+--------+
2 rows in set (0.22 sec)

mh01@3306>
mh01@3306>select index_name,count(*) from information_schema.INNODB_BUFFER_PAGE where INDEX_NAME in('primary','gender') and TABLE_NAME like '%member%' group by index_name;
+------------+----------+
| index_name | count(*) |
+------------+----------+
| gender     |      577 |
| PRIMARY    |      212 |
+------------+----------+
2 rows in set (0.23 sec)

从对应的数据中可以看出,使用join的方式来实现相同的功能,主要还是对gender索引的访问,对primary的访问相对较少。

参考

blog.csdn.net/qq_41642932…