Java进程cpu高负载监控及自动化dump线程堆栈信息

205 阅读3分钟

前言

想必大家经常遇到过线上Java服务老是cpu飙升导致服务不可用或异常的问题,且这类问题又不能长时间维持现场或者这种现象出现的时机只是一刹那间且很短暂,难以复现当时的环境,针对这类问题,我们应该怎么处理呢?


一、安装配置阿里的诊断工具

1、大家可能想到为什么不用JDK自带的jstack等工具进行诊断呢?如果你有深入使用过自带的那些工具会发现诊断过程比较繁琐,最主要没有对高占用线程进行堆栈信息的排序,例如 top 5 自动抓取前5名占用高的线程,而阿里巴巴大名鼎鼎的Arthas就提供了比较好的封装和丰富的功能,开箱即用,不需要我们自己处理。 2、如何安装以及如何使用?这里提供官网,不再过多讲解。 https://arthas.aliyun.com

二、开发自动化监控脚本

此处仅提供Linux系统下三种实现方式,按照需要自己选取即可。

方式一

基于Arthas下面的as.sh脚本,以实现自动监控及自动化dump线程堆栈信息的功能。

方式二

基于Arthas HTTP API 和 Linux jq命令,以实现自动监控及自动化dump线程堆栈信息的功能。

方式三

基于Linux expect命令,以实现自动监控及自动化dump线程堆栈信息的功能。

创建mydomains-CPU-monitor.sh脚本

#!/bin/bash
# work mode: 0 - used as.sh ; 1 - used expect ; 2 - used HTTP API 
MODE=0
# java process monitor log path
JAVA_MONITOR_LOG_LOCATION="/var/log/mydomains-monitor.log"
# arthas thread dump log path
THREAD_DUMP_LOG_LOCATION="/var/log/arthas-thread-dump.log"
# the java progress LISTENED port
PORT=9916
# the java progress id with LISTENED port
JAVA_PID=$(lsof -i :$PORT -t | head -n 1)
# java cpu usage threshold
THRESHOLD=0.0
# the arthas-thread-stack-dump.exp path , about these path , all fill in manually , instead of using "find" cmd 
EXP_LOCATION="/root/dk-java-progress-cpu-usage-monitor/arthas-thread-stack.exp"
# the arthas's as.sh path , the same as the above content
AS_SH_LOCATION="/root/arthas/arthas/as.sh"
# the batch.as path , the same as the above content
BATCH_LOCATION="/root/dk-java-progress-cpu-usage-monitor/batch.as"
# current mydomains-CPU-monitor.sh pid
SHELL_PID=$$
# existed mydomains-CPU-monitor.sh pid (exclued current)
MYDOMAINS_MONITOR_PID=$(ps -ef |grep mydomains-CPU-monitor.sh |grep -v grep |grep -v $SHELL_PID |awk '{print $2}')
# curl params , for these
# maximum time allowed for the transfer (unit -s)
MAX_TIME=14
# maximum time allowed for connection (unit -s)
CONNECT_TIMEOUT=5
# print log
function printLog() {
       echo `date "+%Y-%m-%d %H:%M:%S"` $1 >> $JAVA_MONITOR_LOG_LOCATION  2>&1   
}
# HTTP API & Linux jq
function httpAPI() {
	curl -Ss -X POST --max-time $MAX_TIME --connect-timeout $CONNECT_TIMEOUT  http://localhost:8563/api -d '{ "action":"exec","command":"thread -n 2" }' | jq -r '.body.results[0].busyThreads[]|"name=\"\(.name)\" Id=\(.id) cpuUsage=\(.cpu)% deltaTime=\(.deltaTime)ms time=\(.time)ms state=\(.state)\n" 
	+ (
		if (.stackTrace | type) == "array" and (.stackTrace | length) > 0 then 
			(.stackTrace | map("    at \(.className).\(.methodName)(\(.fileName):\(.lineNumber))") | join("\n"))
		else 
			""
		end
      ) + "\n"'  >> $THREAD_DUMP_LOG_LOCATION 2>&1

}

echo -e "\n" >> $JAVA_MONITOR_LOG_LOCATION  2>&1
if [ ! -z "$MYDOMAINS_MONITOR_PID" ]; then
  printLog "[mydomains-CPU-monitor] this mydomains-CPU-monitor.sh is alread executed -- $MYDOMAINS_MONITOR_PID , cannot repeatedly !"
  exit 0
fi
if [ -z "$JAVA_PID" ]; then
  printLog "[mydomains-CPU-monitor] cannot found this progress id with port -- $PORT"
  exit 0 
fi
printLog "[mydomains-CPU-monitor] execute this mydomains-CPU-monitor.sh , its progress id is $$"

while true; do
  echo -e "\n" >> $JAVA_MONITOR_LOG_LOCATION 2>&1
  # query the java progress cpu usage
  CPU_USAGE=$(ps -o pcpu= -p $JAVA_PID)
 
  # if this java progress cpu usage ge given THRESHOLD , according to the work mode to dump threads' stack information.
  if awk "BEGIN {exit !($CPU_USAGE >= $THRESHOLD)}"; then
		printLog "[mydomains-CPU-monitor] this mydomains instance cpu usage -- $CPU_USAGE is over than the given threshold -- $THRESHOLD !"           
		if [ "$MODE" -eq 0 ];then
			if [[ -z $AS_SH_LOCATION || -z $BATCH_LOCATION ]]; then
				printLog "[mydomains-CPU-monitor] when startup the mode 0 , the as.sh or batch.as path is blank ! please to check it !"
			else
			    printLog "[mydomains-CPU-monitor] the mode is 0 , start to execute batch.as ... "
				$AS_SH_LOCATION -f $BATCH_LOCATION $JAVA_PID >> $THREAD_DUMP_LOG_LOCATION 2>&1
				printLog "[mydomains-CPU-monitor] the mode is 0 , already finished this batch.as !"
			fi
		elif [ "$MODE" -eq 1 ]; then
			threadStackPID=`ps -ef |grep 'arthas-thread-stack.exp' |grep -v grep |awk '{print $2}'`
			if [ -z $threadStackPID ]; then
				printLog "[mydomains-CPU-monitor] the mode is 1 , start to execute arthas-thread-stack.exp ... "
				expect $EXP_LOCATION $JAVA_PID 
				printLog "[mydomains-CPU-monitor] the mode is 1 , already finished this arthas-thread-stack.exp !"	
			else
				printLog "[mydomains-CPU-monitor] arthas-thread-stack.exp is running! cannot be executed repeatedly! "	
			fi
		elif [ "$MODE" -eq 2 ]; then	
			printLog "[mydomains-CPU-monitor] the mode is 2 , start to curl arthas http api ... "
			httpAPI
			printLog "[mydomains-CPU-monitor] the mode is 2 , already finished this curl invoking !"
		else
			printLog "[mydomains-CPU-monitor] cannot to find the supported mode , please to check it !"
		fi
# else
   # printLog "[mydomains-CPU-monitor] this mydomains instance cpu usage is lower than the given threshold !"   
  fi
  # dump thread stack info interval , unit -s
  sleep 15
done

创建arthas-thread-stack.exp脚本

#!/usr/bin/expect -f
# used to expect order max waiting time (unit:s)
set timeout -1

# this arthas-boot.jar absolute path
set arthas_boot_jar /root/arthas/arthas/arthas-boot.jar

# expect outer input param value
set target_pid [lindex $argv 0]

# cpu usage top n thread
set top_n 8

# record script execution logs absolute path
set output_file /var/log/arthas-thread-dump.log

# write logs
proc writeLog {arg1 arg2} {
	log_file $arg1
	set now_time [clock format [clock seconds] -format "%Y-%m-%d %H:%M:%S"]
	send_log $now_time$arg2
	log_file
}

# start a new attach progress

writeLog $output_file " start to attach arthas-boot  ------------------------------------ \n"

spawn java -jar "$arthas_boot_jar" "$target_pid"

# start to attach arthas-boot dashboard
expect {
   -re {\[arthas@.*\]} {
       log_file $output_file
       send "thread -n $top_n\r"
    }
    timeout {
	   writeLog $output_file " cannot attached arthas-boot dashboard !\n"
       exit 1
    }
}

# waiting thread -n command response return
expect {
   -re {\[arthas@.*\]} {
       send "exit\r"
       send "\n"
    }
    timeout {
	   writeLog $output_file " cannot abtain thread -n command result !\n"
       exit 1
    }
}

writeLog $output_file " finsh this time capture task !\n"

expect eof

创建batch.as脚本

thread -n 8

总结

对于监控脚本的部分关键变量可以增加日志,便于脚本执行异常时进行排查,比如基于as.sh脚本进行调用,其中有段逻辑是通过telnet命令检查arthas是否监听,如果命令不存在则会导致监控失败!