Linux生产环境出现CPU占用过高的分析思路和定位

1,541 阅读5分钟

引言:更多相关请看 JVM+GC解析系列
结合Linux和JDK命令一块分析。

案例

代码:(传入Linux并执行)

import java.util.Random;

public class JavaDemo {

    public static void main(String[] args) throws Exception {
        while (true){
            System.out.println(new Random().nextInt(322222222));
        }
    }

}

案例步骤

1.先用top命令找出CPU占比最高的。下图是java程序

top - 23:41:45 up 17:07,  4 users,  load average: 0.89, 0.27, 0.13
Tasks: 268 total,   3 running, 265 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.5 us,  7.6 sy,  0.0 ni, 87.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2027952 total,   105276 free,   854908 used,  1067768 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.   984700 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
 14886 root      20   0 3274164  49988  11920 S  87.7  2.5   0:45.97 java        
 10337 root      20   0  171172   5648   4292 D  11.6  0.3   0:07.74 sshd        
  7356 root      20   0       0      0      0 S   0.7  0.0   0:00.76 kworker/0:2 
    30 root      20   0       0      0      0 R   0.3  0.0   0:04.68 kworker/4:0 
    ...

2.ps -ef或者jps进一步定位,得知是一个怎么样的一个后台程序。

命令:

ps -ef|grep java

效果:

root      14886  10348 81 23:40 pts/0    00:03:33 java JavaDemo
root      14944  10396  0 23:45 pts/1    00:00:00 grep --color=auto java

或:

jps -l

效果:

14886 JavaDemo
14953 sun.tools.jps.Jps
98218 -- process information unavailable

得到Java进程编号是14886

3.定位到具体线程或者代码(哪一行代码)

参数:ps -mp 进程 -o THREAD,tid,time。参数解释:

  -m 显示所有线程
  -p pid进程使用cpu的时间
  -o 该参数后是用户自定义格式

具体参数:

ps -mp 14886 -o THREAD,tid,time

效果:

USER     %CPU PRI SCNT WCHAN  USER SYSTEM    TID     TIME
root     81.0   -    - -         -      -      - 00:06:18
root      0.0  19    - futex_    -      -  14886 00:00:00
root     79.0  19    - -         -      -  14887 00:06:09
root      0.2  19    - futex_    -      -  14888 00:00:01
root      0.2  19    - futex_    -      -  14889 00:00:01
root      0.2  19    - futex_    -      -  14890 00:00:00
root      0.2  19    - futex_    -      -  14891 00:00:00
root      0.2  19    - futex_    -      -  14892 00:00:00
root      0.2  19    - futex_    -      -  14893 00:00:01
root      0.2  19    - futex_    -      -  14894 00:00:01
root      0.2  19    - futex_    -      -  14895 00:00:01
root      0.1  19    - futex_    -      -  14896 00:00:00
root      0.0  19    - futex_    -      -  14897 00:00:00
root      0.0  19    - futex_    -      -  14898 00:00:00
root      0.0  19    - futex_    -      -  14899 00:00:00
root      0.0  19    - futex_    -      -  14900 00:00:00
root      0.0  19    - futex_    -      -  14901 00:00:00
root      0.0  19    - futex_    -      -  14902 00:00:00
root      0.0  19    - futex_    -      -  14903 00:00:00
root      0.0  19    - futex_    -      -  14904 00:00:00
root      0.0  19    - futex_    -      -  14905 00:00:00

可以看到tid为14887的线程编号%CPU占比为79.0。14887就是CPU占用过高的线程tid。

4.将需要的线程ID转换为16进制格式(英文小写格式)

将线程14887转为16进制3A27,英文小写是3a27。 命令:(将有问题的线程ID转为16进制且英文是小写) printf "%x\n" 有问题的线程ID(16进制格式)。如: 如命令:

printf "%x\n" 14887

效果:

3a27

5.jstack 进程ID | grep tid(16进制线程ID小写英文) -A60

命令:

jstack 14886 | grep 3a27 -A60

效果:

"main" #1 prio=5 os_prio=0 tid=0x00007f48b4009000 nid=0x3a27 runnable [0x00007f48bc6f7000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	- locked <0x00000000e10191e8> (a java.io.BufferedOutputStream)
	at java.io.PrintStream.write(PrintStream.java:482)
	- locked <0x00000000e1007508> (a java.io.PrintStream)
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
	at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
	- locked <0x00000000e1019308> (a java.io.OutputStreamWriter)
	at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
	at java.io.PrintStream.write(PrintStream.java:527)
	- eliminated <0x00000000e1007508> (a java.io.PrintStream)
	at java.io.PrintStream.print(PrintStream.java:597)
	at java.io.PrintStream.println(PrintStream.java:736)
	- locked <0x00000000e1007508> (a java.io.PrintStream)
	at JavaDemo.main(JavaDemo.java:8)

"VM Thread" os_prio=0 tid=0x00007f48b4080000 nid=0x3a30 runnable 

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f48b401e800 nid=0x3a28 runnable 

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f48b4020800 nid=0x3a29 runnable 

"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f48b4022000 nid=0x3a2a runnable 

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f48b4024000 nid=0x3a2b runnable 

"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007f48b4026000 nid=0x3a2c runnable 

"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007f48b4027800 nid=0x3a2d runnable 

"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007f48b4029800 nid=0x3a2e runnable 

"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007f48b402b000 nid=0x3a2f runnable 

"VM Periodic Task Thread" os_prio=0 tid=0x00007f48b40ce000 nid=0x3a39 waiting on condition 

JNI global references: 5

其中有一个段效果:

JavaDemo.main(JavaDemo.java:8)

表示是在代码的JavaDemo的第八行。也就是下图的位置: