问题
MySQL 没有像 Oracle 中的 AWR 一样,可以获取某个时刻的快照来做性能分析,很难追溯问题时间点发生了什么。针对此问题,可以基于pt-stalk做一个触发式监控,当某个指标达到设定阈值时,执行 pt-stalk 来收集当时的一些数据库运行状态等。
MySQL 虽然有慢查询,但是慢查询 SQL 作为一个结果,并不能是“因”,所以我们需要一个捕捉故障现场的工具。
重要参数介绍
触发条件的参数
- function: 默认为 status,代表监控 SHOW GLOBAL STATUS 的输出; 也可以设置为 processlist,代表监控 show processlist 的输出;
- variable: 默认为 Threads_running,代表 监控参数,根据上述监控输出指定具体的监控项;
- threshold: 默认为 25,代表 监控阈值,监控参数超过阈值,则满足触发条件; 监控参数的值非数字时,需要配合 match 参数一起使用,如 processlist 的 state 列;
- cycles: 默认为 5,表示连续观察到五次满足触发条件时,才触发收集;
- 连接参数:host、password、port、socket。
其他重要参数
- iterations:该参数指定 pt-stalk 在触发收集几次后退出,默认会一直运行。
- run-time:触发收集后,该参数指定收集多长时间的数据,默认 30 秒。
- sleep:该参数指定在触发收集后,sleep 多久后继续监控,默认 300 秒。
- interval:指定状态参数的检查频率,判断是否需要触发收集,默认 1 秒。
- dest:监控数据存放路径,默认为 /var/lib/pt-stalk。
- retention-time :监控数据保留时长,默认 30 天。
- daemonize:以后台服务运行,默认不开启。
- log:后台运行日志,默认为 /var/log/pt-stalk.log。
- collect:触发发生时收集诊断数据,默认开启。
-
- collect-gdb:收集 GDB 堆栈跟踪,需要 gdb 工具。
- collect-strace:收集跟踪数据,需要 strace 工具。
- collect-tcpdump:收集 tcpdump 数据,需要 tcpdump 工具。
演示
场景一
需求:连接数超过某个阈值,触发收集信息
实现:监视SHOW GLOBAL STATUS中的Threads_connected状态值,如果该值超过5,则触发收集主机和MySQL的性能、状态信息。pt-stalk会每隔一秒检查一次状态值,如果连续5次满足触发条件,则开始收集。
1、运行pt-stalk
pt-stalk --function=status --variable=Threads_connected --threshold=5 \
--user=root --password=Qwer1234 --host=127.0.0.1 \
--cycles=5 \
--interval=1 \
--iterations=3 \
--run-time=3 \
--sleep=60 \
--sleep-collect=1 \
--daemonize
Sangfor:DBVM/host-fefcfe935efb ~ # ps -ef|grep pt-stalk
root 2857570 1 0 11:05 pts/0 00:00:00 bash /usr/bin/pt-stalk --function status --variable Threads_connected --threshold 2500 --daemonize --user=root --password=Qwer1234 --host=127.0.0.1
root 2858586 2803110 0 11:05 pts/0 00:00:00 grep --color=auto pt-stalk
2、模拟连接超过5
mysql -uroot -pQwer1234 -h127.0.0.1 -e "
drop database if exists sbtest;
create database sbtest;"
sysbench /usr/share/sysbench/oltp_read_write.lua \
--mysql-socket=/run/sock/mysql.sock --mysql-user=root --mysql-password=Qwer1234 \
--db-ps-mode=disable --report-interval=1 --forced-shutdown=1 \
--mysql-db=sbtest \
--tables=10 --table-size=10000 \
--threads=10 \
prepare
sysbench /usr/share/sysbench/oltp_read_write.lua \
--mysql-socket=/run/sock/mysql.sock --mysql-user=root --mysql-password=Qwer1234 \
--db-ps-mode=disable --report-interval=1 --forced-shutdown=1 \
--mysql-db=sbtest \
--tables=10 --table-size=10000 \
--threads=10 --time=3600 \
run
默认日志存放路径为/var/log/pt-stalk.log
tailf -n 500 /var/log/pt-stalk.log
收集的信息默认存放在/var/lib/pt-stalk中
Sangfor:DBVM/host-fefcfe935efb /var/lib/pt-stalk # ll
total 9164
-rw-r--r-- 1 root root 13264 Mar 31 11:24 2023_03_31_11_23_57-df
-rw-r--r-- 1 root root 121 Mar 31 11:24 2023_03_31_11_23_57-disk-space
-rw-r--r-- 1 root root 20416 Mar 31 11:24 2023_03_31_11_23_57-diskstats
-rw-r--r-- 1 root root 18 Mar 31 11:24 2023_03_31_11_23_57-hostname
-rw-r--r-- 1 root root 11600 Mar 31 11:23 2023_03_31_11_23_57-innodbstatus1
-rw-r--r-- 1 root root 11598 Mar 31 11:24 2023_03_31_11_23_57-innodbstatus2
-rw-r--r-- 1 root root 36144 Mar 31 11:24 2023_03_31_11_23_57-interrupts
-rw-r--r-- 1 root root 20644 Mar 31 11:24 2023_03_31_11_23_57-iostat
-rw-r--r-- 1 root root 1436 Mar 31 11:24 2023_03_31_11_23_57-iostat-overall
-rw-r--r-- 1 root root 511 Mar 31 11:24 2023_03_31_11_23_57-lock-waits
-rw-r--r-- 1 root root 1689 Mar 31 11:24 2023_03_31_11_23_57-log_error
-rw-r--r-- 1 root root 17462 Mar 31 11:24 2023_03_31_11_23_57-lsof
-rw-r--r-- 1 root root 22512 Mar 31 11:24 2023_03_31_11_23_57-meminfo
-rw-r--r-- 1 root root 9115 Mar 31 11:24 2023_03_31_11_23_57-mpstat
-rw-r--r-- 1 root root 647 Mar 31 11:24 2023_03_31_11_23_57-mpstat-overall
-rw-r--r-- 1 root root 1347 Mar 31 11:24 2023_03_31_11_23_57-mutex-status1
-rw-r--r-- 1 root root 1347 Mar 31 11:24 2023_03_31_11_23_57-mutex-status2
-rw-r--r-- 1 root root 7732319 Mar 31 11:24 2023_03_31_11_23_57-mysqladmin
-rw-r--r-- 1 root root 232858 Mar 31 11:24 2023_03_31_11_23_57-netstat
-rw-r--r-- 1 root root 45024 Mar 31 11:24 2023_03_31_11_23_57-netstat_s
-rw-r--r-- 1 root root 5164 Mar 31 11:24 2023_03_31_11_23_57-opentables1
-rw-r--r-- 1 root root 5164 Mar 31 11:24 2023_03_31_11_23_57-opentables2
-rw-r--r-- 1 root root 28388 Mar 31 11:25 2023_03_31_11_23_57-output
-rw-r--r-- 1 root root 18468 Mar 31 11:23 2023_03_31_11_23_57-pmap
-rw-r--r-- 1 root root 68515 Mar 31 11:24 2023_03_31_11_23_57-processlist
-rw-r--r-- 1 root root 52400 Mar 31 11:24 2023_03_31_11_23_57-procstat
-rw-r--r-- 1 root root 44493 Mar 31 11:24 2023_03_31_11_23_57-procvmstat
-rw-r--r-- 1 root root 15596 Mar 31 11:24 2023_03_31_11_23_57-ps
-rw-r--r-- 1 root root 276960 Mar 31 11:24 2023_03_31_11_23_57-slabinfo
-rw-r--r-- 1 root root 6256 Mar 31 11:24 2023_03_31_11_23_57-slave-status
-rw-r--r-- 1 root root 33654 Mar 31 11:24 2023_03_31_11_23_57-sysctl
-rw-r--r-- 1 root root 262637 Mar 31 11:25 2023_03_31_11_23_57-top
-rw-r--r-- 1 root root 146039 Mar 31 11:24 2023_03_31_11_23_57-transactions
-rw-r--r-- 1 root root 384 Mar 31 11:23 2023_03_31_11_23_57-trigger
-rw-r--r-- 1 root root 116080 Mar 31 11:23 2023_03_31_11_23_57-variables
-rw-r--r-- 1 root root 2771 Mar 31 11:24 2023_03_31_11_23_57-vmstat
-rw-r--r-- 1 root root 325 Mar 31 11:24 2023_03_31_11_23_57-vmstat-overall
收集到的信息如上 ,我们可以看一下*processlist,查看此时数据库都有哪些连接
*************************** 1. row ***************************
Id: 2
User: sangforroot
Host: 10.5.54.184:54504
db: NULL
Command: Sleep
Time: 1
State:
Info: NULL
*************************** 2. row ***************************
Id: 3
User: sangforroot
Host: 10.5.54.184:54508
db: NULL
Command: Sleep
Time: 1
State:
Info: NULL
*************************** 3. row ***************************
Id: 4
User: sangforroot
Host: 10.5.54.184:54510
db: NULL
Command: Sleep
Time: 1
State:
Info: NULL
*************************** 4. row ***************************
Id: 214
User: sangforroot
Host: localhost
db: NULL
Command: Binlog Dump
Time: 349267
State: Master has sent all binlog to slave; waiting for more updates
Info: NULL
*************************** 5. row ***************************
Id: 118279
User: root
Host: localhost
db: sbtest
Command: Query
Time: 0
State: Sending to client
Info: SELECT DISTINCT c FROM sbtest4 WHERE id BETWEEN 4978 AND 5077 ORDER BY c
*************************** 6. row ***************************
Id: 118280
User: root
Host: localhost
db: sbtest
Command: Query
Time: 0
State: Creating sort index
Info: SELECT c FROM sbtest5 WHERE id BETWEEN 5032 AND 5131 ORDER BY c
*************************** 7. row ***************************
Id: 118281
User: root
Host: localhost
db: sbtest
Command: Query
Time: 0
State: Sending to client
Info: SELECT DISTINCT c FROM sbtest4 WHERE id BETWEEN 4988 AND 5087 ORDER BY c
场景二
需求:立即收集当前数据库以及主机运行情况
实现:非后台模式下 立即收集 主机和 MySQL 的性能与状态信息 60s,等待 180s 自动退出,无需任何触发条件,主要用于临时保留问题场景信息,便于事后分析。
此脚本比较适合做定制,如CPU使用率达到90%,就执行如下脚本
pt-stalk --no-stalk --run-time=60 --iterations=1 --user=root --password=Qwer1234 --host=127.0.0.1
自定义脚本,定义CPU作为触发条件
function trg_plugin(){
a=$(sar 1 1 | grep -i "Average:"| awk '{print $8}');echo 100 - $a |bc
补充代码逻辑
$a >= 90
执行pt-stalk脚本
打印告警
}
查看运行线程数
Sangfor:DBVM/host-fefcfe935efb /var/lib/pt-stalk # grep "Threads_running" 2023_03_31_12_35_53-mysqladmin|more
| Threads_running | 13
|
| Threads_running | 12
|
| Threads_running | 12
|
| Threads_running | 11
|
| Threads_running | 13
获取锁等待信息
Sangfor:DBVM/host-fefcfe935efb /var/lib/pt-stalk # cat 2023_03_31_12_35_53-lock-waits
*************************** 1. row ***************************
waiting_trx_id: 4543697
waiting_thread: 121158
wait_time: 0
waiting_query: UPDATE sbtest10 SET c='62105750248-40905263373-27066081458-89619250545-39967039915-57767782411-38381465013-10861659840-25930497389-99995825885' WHERE id=4995
waiting_table_lock: `sbtest`.`sbtest10`
blocking_trx_id: 4543694
blocking_thread: 121151
blocking_host:
blocking_port: localhost
idle_in_trx: 0
blocking_query: COMMIT
*************************** 1. row ***************************
waiting_trx_id: 4544867
waiting_thread: 121156
wait_time: 0
waiting_query: DELETE FROM sbtest4 WHERE id=4978
waiting_table_lock: `sbtest`.`sbtest4`
blocking_trx_id: 4544872
blocking_thread: 121159
blocking_host:
blocking_port: localhost
idle_in_trx: 0
blocking_query: COMMIT
*************************** 1. row ***************************
waiting_trx_id: 4549890
waiting_thread: 121159
wait_time: 0
waiting_query: UPDATE sbtest2 SET c='11936131662-14491864212-50210651746-24762888650-82865793955-05779236494-92950621249-85242503791-08032018653-07744788154' WHERE id=5028
waiting_table_lock: `sbtest`.`sbtest2`
blocking_trx_id: 4549889
blocking_thread: 121157
blocking_host:
blocking_port: localhost
idle_in_trx: 0
blocking_query: COMMIT
*************************** 1. row ***************************
waiting_trx_id: 4551639
waiting_thread: 121159
wait_time: 0
waiting_query: DELETE FROM sbtest2 WHERE id=5029
waiting_table_lock: `sbtest`.`sbtest2`
blocking_trx_id: 4551635
blocking_thread: 121154
blocking_host:
blocking_port: localhost
idle_in_trx: 0
blocking_query: COMMIT
补充
function也可以捕捉“show processlist”中的值,比如捕捉state中出现“Waiting for global read lock”来判断是否存在刷新锁
pt-stalk --function=processlist --variable=State --match="Waiting for global read lock" --threshold=5 \
--user=root --password=Qwer1234 --host=127.0.0.1 \
--cycles=5 \
--interval=1 \
--iterations=3 \
--run-time=3 \
--sleep=60 \
--sleep-collect=1 \
--daemonize
查看porcesslist文件
Sangfor:DBVM/host-fefcfe935efb /var/lib/pt-stalk # cat 2023_03_31_13_16_44-processlist
*************************** 14. row ***************************
Id: 122868
User: root
Host: localhost
db: sbtest
Command: Query
Time: 58
State: Waiting for global read lock
Info: UPDATE sbtest1 SET k=k+1 WHERE id=5008
*************************** 15. row ***************************
Id: 122869
User: root
Host: localhost
db: sbtest
Command: Query
Time: 58
State: Waiting for global read lock
Info: UPDATE sbtest5 SET k=k+1 WHERE id=5046
*************************** 16. row ***************************
Id: 122870
User: root
Host: localhost
db: sbtest
Command: Query
Time: 58
State: Waiting for global read lock
Info: UPDATE sbtest4 SET k=k+1 WHERE id=6148
*************************** 17. row ***************************
Id: 122871
User: root
Host: localhost
db: sbtest
Command: Query
Time: 58
State: Waiting for global read lock
Info: UPDATE sbtest9 SET k=k+1 WHERE id=5041
从show processlist中可以看看到,确实存在“Waiting for global read lock”。