故事背景
闲来无事发现etcdctl提供了check perf功能来测试etcd的性能。
# etcdctl check perf
60 / 60 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1m0s
PASS: Throughput is 150 writes/s
PASS: Slowest request took 0.208155s
PASS: Stddev is 0.008008s
PASS
测试结果是通过的,但是写的QPS有点低,遂看了一下文档,发现可以通过--load来模拟不同负载情况下(Accepted workloads: s(small), m(medium), l(large), xl(xLarge))的性能情况,那咱必须整个高端的xl试试,然后悲剧就发生了。
因为压测跑到一半就开始刷mvcc: database space exceeded报错
# etcdctl check perf --load="xl"
...
{"level":"warn","ts":"2023-08-23T12:27:44.410Z","logger":"etcd-client","caller":"v3@v3.5.6/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00043ae00/10.1.0.201:2383","attempt":0,"error":"rpc error: code = ResourceExhausted desc = etcdserver: mvcc: database space exceeded"}
60 / 60 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1m0s
FAIL: too many errors
FAIL: ERROR(etcdserver: mvcc: database space exceeded) -> 158018
PASS: Throughput is 14121 writes/s
PASS: Slowest request took 0.073069s
PASS: Stddev is 0.006932s
FAIL
包括集群k8s的请求也出现了报错Error from server: etcdserver: mvcc: database space exceeded, 就这么喜提集群崩坏😭
报错原因
根据报错查看集群状态, 不出意外DB大小已经达到ETCD默认的2GB上限
# etcdctl endpoint --cluster status -w=table
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| https://10.1.0.202:2383 | 5f6505adceeb9f3a | 3.5.6 | 2.1 GB | true | false | 9 | 423725 | 423725 | memberID:14378557705461272149 |
| | | | | | | | | | alarm:NOSPACE , |
| | | | | | | | | | memberID:18129117837516069716 |
| | | | | | | | | | alarm:NOSPACE , |
| | | | | | | | | | memberID:6873906650309959482 |
| | | | | | | | | | alarm:NOSPACE |
| https://10.1.0.203:2383 | c78ae60d60798e55 | 3.5.6 | 2.1 GB | false | false | 9 | 423725 | 423725 | memberID:6873906650309959482 |
| | | | | | | | | | alarm:NOSPACE , |
| | | | | | | | | | memberID:14378557705461272149 |
| | | | | | | | | | alarm:NOSPACE , |
| | | | | | | | | | memberID:18129117837516069716 |
| | | | | | | | | | alarm:NOSPACE |
| https://10.1.0.201:2383 | fb97909efc57db54 | 3.5.6 | 2.1 GB | false | false | 9 | 423725 | 423725 | memberID:14378557705461272149 |
| | | | | | | | | | alarm:NOSPACE , |
| | | | | | | | | | memberID:18129117837516069716 |
| | | | | | | | | | alarm:NOSPACE , |
| | | | | | | | | | memberID:6873906650309959482 |
| | | | | | | | | | alarm:NOSPACE
# etcdctl alarm list
memberID:6873906650309959482 alarm:NOSPACE
memberID:14378557705461272149 alarm:NOSPACE
memberID:18129117837516069716 alarm:NOSPACE
修复方案
为什么DB容量会被写爆呢?
这是因为etcd的mvcc会保存一个数据的多个历史版本,所以压测结束虽然会删除压测使用的key,但因为etcd会保留key的历史版本信息,所以空间就一直被占用。
加之我使用的etcd磁盘性能也是非常给力,一次压测就把空间写爆了。
这种情况可先手动清理历史数据来让etcd恢复到正常容量。
清理步骤
1、先获取当前的revision
etcdctl endpoint status -w=json | jq .[0].Status.header.revision
3248964
2、compaction历史
etcdctl compaction 3248964
ompacted revision 3248964
3、碎片整理后空间才会释放
# etcdctl defrag --cluster
Finished defragmenting etcd member[https://10.1.0.202:2383]
Finished defragmenting etcd member[https://10.1.0.203:2383]
Finished defragmenting etcd member[https://10.1.0.201:2383]
4、确认空间已经释放
etcdctl endpoint status -w=json | jq .[].Status.dbSize
20480
20480
20480
5、清理掉告警,etcd才能恢复写
# etcdctl alarm disarm
memberID:6873906650309959482 alarm:NOSPACE
memberID:14378557705461272149 alarm:NOSPACE
memberID:18129117837516069716 alarm:NOSPACE
教训
不要随便拿生产环境的数据库进行压测,不仅影响性能还有可能出现意外导致故障。