摘要
我在 MySQL8.0上创建了10亿+张InnoDB表(注意是表而不是行),如下:
$ mysql -A
Welcome to the MySQL monitor. Commands end with ; or g.
Your MySQL connection id is 1425329
Server version: 8.0.12 MySQL Community Server - GPL
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
mysql> select count(*) from information_schema.tables;
+------------+
| count(*) |
+------------+
| 1011570298 |
+------------+
1 row in set (6 hours 57 min 6.31 sec)
是的,它耗费了6小时57分钟去统计表数目!
谁会需要创建10亿+张表?
在我之前的文章中,我创建和测试了MySQL 8.0上创建4000w张表(这是一个真实的案例)。不过10亿张表不是真实的案例场景,是因为我想挑战下在PG上创建了10亿张表的测试,所以准备在MySQL下创建下10亿张InnoDB表。
注:我认为MySQL8.0才是第一个具有创建10亿张InnoDB表可能性的MySQL版本。
挑战10亿张InnoDB表
磁盘空间
首先面临第一个也是最重要的挑战就是磁盘空间。创建.ibd文件时,InnoDB在磁盘上分配数据页。如果不做磁盘压缩,我们至少需要25T的存储容量。不过好消息是:我们的ZFS提供透明的磁盘压缩。以下是磁盘利用率的表现:
实际大小:
# du -sh --apparent-size /mysqldata/
26T /mysqldata/
压缩后:
# du -sh /mysqldata/
2.4T /mysqldata/
压缩率:
# zfs get compression,compressratio
...
mysqldata/mysql/data compressratio 7.14x -
mysqldata/mysql/data compression gzip inherited from mysqldata/mysql
(看起来报告不是100%准确,我们达到了10倍+的压缩率)
许多小文件
为每张表要创建一个表空间文件,这是大问题。不过在MySQL 8.0中,我们可以创建一个通用表空间(General Tablespace)并在创建表时将表”分配“到表空间上。这里我为每个database创建一个通用表空间,每个database上创建了1000张表。
结果就是:
mysql> select count(*) from information_schema.schema;
+----------+
| count(*) |
+----------+
| 1011575 |
+----------+
1 row in set (1.31 sec)
创建表
另一个挑战点就是如何快速的创建表从而避免我们要耗费数月的时间。我用了三个锦囊妙计:
- 禁用MySQL里面一切可能的一致性检测,减小innodb的page大小为4k(这些配置更改不适合生产环境)
- 并发创建表。因为之前MySQL 8.0中的互斥量争用问题已经得到修复,所以并发创建表表现良好。
- 在AWS ec2 i3.8 xlarge的实例上使用本地的NVMe卡
my.cnf config file (I repeat: do not use this in production):
my.cnf的配置信息如下(重申一遍:不要直接用在生产上):
[mysqld]
default-authentication-plugin = mysql_native_password
performance_schema=0
datadir=/mysqldata/mysql/data
socket=/mysqldata/mysql/data/mysql.sock
log-error = /mysqldata/mysql/log/error.log
skip-log-bin=1
innodb_log_group_home_dir = /mysqldata/mysql/log/
innodb_doublewrite = 0
innodb_checksum_algorithm=none
innodb_log_checksums=0
innodb_flush_log_at_trx_commit=0
innodb_log_file_size=2G
innodb_buffer_pool_size=100G
innodb_page_size=4k
innodb_flush_method=nosync
innodb_io_capacity_max=20000
innodb_io_capacity=5000
innodb_buffer_pool_instances=32
innodb_stats_persistent = 0
tablespace_definition_cache = 524288
schema_definition_cache = 524288
table_definition_cache = 524288
table_open_cache=524288
table_open_cache_instances=32
open-files-limit=1000000
ZFS pool:
# zpool status
pool: mysqldata
state: ONLINE
scan: scrub repaired 0B in 1h49m with 0 errors on Sun Oct 14 02:13:17 2018
config:
NAME STATE READ WRITE CKSUM
mysqldata ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
errors: No known data errors
一个简单的并发创建表的脚本(表结构使用了sysbench里面的表):
#/bin/bash
function do_db {
db_exist=$(mysql -A -s -Nbe "select 1 from information_schema.schemata where schema_name = '$db'")
if [ "$db_exist" == "1" ]; then echo "Already exists: $db"; return 0; fi;
tbspace="create database $db; use $db; CREATE TABLESPACE $db ADD DATAFILE '$db.ibd' engine=InnoDB";
#echo "Tablespace $db.ibd created!"
tables=""
for i in {1..1000}
do
table="CREATE TABLE sbtest$i ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT '0', c varchar(120) NOT NULL DEFAULT '', pad varchar(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k_1 (k) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 tablespace $db;"
tables="$tables; $table;"
done
echo "$tbspace;$tables" | mysql
}
c=0
echo "starting..."
c=$(mysql -A -s -Nbe "select max(cast(SUBSTRING_INDEX(schema_name, '_', -1) as unsigned)) from information_schema.schemata where schema_name like 'sbtest_%'")
for m in {1..100000}
do
echo "m=$m"
for i in {1..30}
do
let c=$c+1
echo $c
db="sbtest_$c"
do_db &
done
wait
done
我们创建表有多快呢?可以通过下面的状态量观测:
# mysqladmin -i 10 -r ex|grep Com_create_table
...
| Com_create_table | 6497 |
| Com_create_table | 6449
我们约每秒创建650张表,上面是每10秒创建的表数量。
统计表数量
之前我们通过"count(*) from information_schema.tables"方式查看表数量耗费了6个多小时。因为:
-
MySQL 8.0 使用了一个新的数据字典(这很妙,避免创建10亿个frm文件)。所有的内容都存储在下面这个文件里:
ls -lah /mysqldata/mysql/data/mysql.ibd
-rw-r----- 1 mysql mysql 6.1T Oct 18 15:02 /mysqldata/mysql/data/mysql.ibd
-
information_schema.tables实际上是一个视图:
mysql> show create table information_schema.tablesG *************************** 1. row *************************** View: TABLES Create View: CREATE ALGORITHM=UNDEFINED DEFINER=
mysql.infoschema@localhostSQL SECURITY DEFINER VIEWinformation_schema.TABLESAS selectcat.nameASTABLE_CATALOG,sch.nameASTABLE_SCHEMA,tbl.nameASTABLE_NAME,tbl.typeASTABLE_TYPE,if((tbl.type= 'BASE TABLE'),tbl.engine,NULL) ASENGINE,if((tbl.type= 'VIEW'),NULL,10) ASVERSION,tbl.row_formatASROW_FORMAT,internal_table_rows(sch.name,tbl.name,if(isnull(tbl.partition_type),tbl.engine,''),tbl.se_private_id,(tbl.hidden<> 'Visible'),ts.se_private_data,coalesce(stat.table_rows,0),coalesce(cast(stat.cached_timeas unsigned),0)) ASTABLE_ROWS,internal_avg_row_length(sch.name,tbl.name,if(isnull(tbl.partition_type),tbl.engine,''),tbl.se_private_id,(tbl.hidden<> 'Visible'),ts.se_private_data,coalesce(stat.avg_row_length,0),coalesce(cast(stat.cached_timeas unsigned),0)) ASAVG_ROW_LENGTH,internal_data_length(sch.name,tbl.name,if(isnull(tbl.partition_type),tbl.engine,''),tbl.se_private_id,(tbl.hidden<> 'Visible'),ts.se_private_data,coalesce(stat.data_length,0),coalesce(cast(stat.cached_timeas unsigned),0)) ASDATA_LENGTH,internal_max_data_length(sch.name,tbl.name,if(isnull(tbl.partition_type),tbl.engine,''),tbl.se_private_id,(tbl.hidden<> 'Visible'),ts.se_private_data,coalesce(stat.max_data_length,0),coalesce(cast(stat.cached_timeas unsigned),0)) ASMAX_DATA_LENGTH,internal_index_length(sch.name,tbl.name,if(isnull(tbl.partition_type),tbl.engine,''),tbl.se_private_id,(tbl.hidden<> 'Visible'),ts.se_private_data,coalesce(stat.index_length,0),coalesce(cast(stat.cached_timeas unsigned),0)) ASINDEX_LENGTH,internal_data_free(sch.name,tbl.name,if(isnull(tbl.partition_type),tbl.engine,''),tbl.se_private_id,(tbl.hidden<> 'Visible'),ts.se_private_data,coalesce(stat.data_free,0),coalesce(cast(stat.cached_timeas unsigned),0)) ASDATA_FREE,internal_auto_increment(sch.name,tbl.name,if(isnull(tbl.partition_type),tbl.engine,''),tbl.se_private_id,(tbl.hidden<> 'Visible'),ts.se_private_data,coalesce(stat.auto_increment,0),coalesce(cast(stat.cached_timeas unsigned),0),tbl.se_private_data) ASAUTO_INCREMENT,tbl.createdASCREATE_TIME,internal_update_time(sch.name,tbl.name,if(isnull(tbl.partition_type),tbl.engine,''),tbl.se_private_id,(tbl.hidden<> 'Visible'),ts.se_private_data,coalesce(cast(stat.update_timeas unsigned),0),coalesce(cast(stat.cached_timeas unsigned),0)) ASUPDATE_TIME,internal_check_time(sch.name,tbl.name,if(isnull(tbl.partition_type),tbl.engine,''),tbl.se_private_id,(tbl.hidden<> 'Visible'),ts.se_private_data,coalesce(cast(stat.check_timeas unsigned),0),coalesce(cast(stat.cached_timeas unsigned),0)) ASCHECK_TIME,col.nameASTABLE_COLLATION,internal_checksum(sch.name,tbl.name,if(isnull(tbl.partition_type),tbl.engine,''),tbl.se_private_id,(tbl.hidden<> 'Visible'),ts.se_private_data,coalesce(stat.checksum,0),coalesce(cast(stat.cached_timeas unsigned),0)) ASCHECKSUM,if((tbl.type= 'VIEW'),NULL,get_dd_create_options(tbl.options,if((ifnull(tbl.partition_expression,'NOT_PART_TBL') = 'NOT_PART_TBL'),0,1))) ASCREATE_OPTIONS,internal_get_comment_or_error(sch.name,tbl.name,tbl.type,tbl.options,tbl.comment) ASTABLE_COMMENTfrom (((((mysql.tablestbljoinmysql.schemataschon((tbl.schema_id=sch.id))) joinmysql.catalogscaton((cat.id=sch.catalog_id))) left joinmysql.collationscolon((tbl.collation_id=col.id))) left joinmysql.tablespacestson((tbl.tablespace_id=ts.id))) left joinmysql.table_statsstaton(((tbl.name=stat.table_name) and (sch.name=stat.schema_name)))) where (can_access_table(sch.name,tbl.name) and is_visible_dd_object(tbl.hidden)) character_set_client: utf8 collation_connection: utf8_general_ci
而且通过explain看到它的执行计划如下:
mysql> explain select count(*) from information_schema.tables G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cat
partitions: NULL
type: index
possible_keys: PRIMARY
key: name
key_len: 194
ref: NULL
rows: 1
filtered: 100.00
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: tbl
partitions: NULL
type: ALL
possible_keys: schema_id
key: NULL
key_len: NULL
ref: NULL
rows: 1023387060
filtered: 100.00
Extra: Using where; Using join buffer (Block Nested Loop)
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: sch
partitions: NULL
type: eq_ref
possible_keys: PRIMARY,catalog_id
key: PRIMARY
key_len: 8
ref: mysql.tbl.schema_id
rows: 1
filtered: 11.11
Extra: Using where
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: stat
partitions: NULL
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 388
ref: mysql.sch.name,mysql.tbl.name
rows: 1
filtered: 100.00
Extra: Using index
*************************** 5. row ***************************
id: 1
select_type: SIMPLE
table: ts
partitions: NULL
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: mysql.tbl.tablespace_id
rows: 1
filtered: 100.00
Extra: Using index
*************************** 6. row ***************************
id: 1
select_type: SIMPLE
table: col
partitions: NULL
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: mysql.tbl.collation_id
rows: 1
filtered: 100.00
Extra: Using index
结论
只是因为个人兴趣,我在MySQL 8.0上创建了10亿张InnoDB表和索引,我成功了。它花费了我大约2周的时间。
大概率MySQL 8.0是MySQL里面第一个支持能够创建10亿张InnoDB表的版本。
ZFS 的压缩再结合NVMe卡,可以降低成本。例如,选择AWS的i3.4xlarge或者i3.8xlarge实例。