JVM崩溃故障

835 阅读2分钟

首先感谢下列链接中提到的关于使用MAT进行详尽的headdump分析的文章。里面有很多提到的细节在以往的工作中没有尝试或者忽略了,推荐新手多多研读。

www.cnblogs.com/trust-freed…

过国庆期间负责的系统出现了进程崩溃,在日志中发现了以下内容,以往OOM的日志都是相对比较简单的,这里出现了一些不常见到的报错内容。虽然内容比较多,仔细梳理还是能看到其实都是由于OOM导致的,是一种连锁反应,包括部署失败,控制台失联,RJVM停止,JobRunShell参数取不到等。

该项目使用的是JDK1.8, weblogic版本是12.2.1.3,在应用程序中配置了很多的JOB,批量执行的时候就会创建许多的object, 一旦GC回收不上来就容易出现JVM资源不足的告警。

<Oct 7, 2021 9:47:23,570 AM CST> <Error> <RJVM> <BEA-000580> <Failed to send JVMMessage from: '-7949835275530680187S:
10.1.30.73:proddomain:appserver1' to: '3659578547004677836S:xx.xx.xx.73:[7001,-1,-1,-1,-1,-1,-1]:proddomain:AdminServe
r' cmd: 'CMD_ONE_WAY', QOS: '101', responseId: '0', invokableId: '0', flags: 'JVMIDs Not Sent, TX Context Not Sent, 0
x0', abbrev offset: '0'. 
 java.io.IOException: Broken pipe
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:65)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        Truncated. see log file for complete stacktrace
> 
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid25089.hprof ...
Heap dump file created [6479774438 bytes in 120.034 secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="/threaddump1.sh"
#   Executing /bin/sh -c "/threaddump1.sh"...
----count:1-----listsize:0----
begin scan task! Time:Thu Oct 07 10:19:14 CST 2021
<Oct 7, 2021 10:19:14,050 AM CST> <Error> <HTTP> <BEA-101017> <[ServletContext@1831340045[app:bea_wls_deployment_internal module:bea_wls_deployment_internal.war path:null spec-version:3.1]] Root cause of ServletException.
java.lang.NoClassDefFoundError: Could not initialize class weblogic.deploy.service.internal.DeploymentServiceLogger$MessageLoggerInitializer
        at weblogic.deploy.service.internal.DeploymentServiceLogger.logDomainWideSecretMismatchLoggable(DeploymentServiceLogger.java:449)
        at weblogic.deploy.service.internal.transport.http.DeploymentServiceServlet.authenticateRequest(DeploymentServiceServlet.java:895)
        at weblogic.deploy.service.internal.transport.http.DeploymentServiceServlet.doPost(DeploymentServiceServlet.java:210)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        Truncated. see log file for complete stacktrace
> 
<Oct 7, 2021 10:19:14,065 AM CST> <Warning> <Socket> <BEA-000449> <Closing the socket, as no data read from it on 10.1.30.248:62,236 during the configured idle timeout of 5 seconds.> 
<Oct 7, 2021 10:19:14,066 AM CST> <Warning> <Socket> <BEA-000449> <Closing the socket, as no data read from it on 10.1.30.248:49,042 during the configured idle timeout of 5 seconds.> 
----count:1-----listsize:0----
<Oct 7, 2021 10:19:14,067 AM CST> <Warning> <HarvesterJMX> <BEA-2011102> <An unexpected condition was encountered in harvester WLSBeanTreeHarvester. Logging it here and continuing processing... Context: A problem occurred collecting the value of attribute HealthState.State on instance com.bea:Name=appserver1,Type=ServerRuntime.> 
begin scan task! Time:Thu Oct 07 10:18:37 CST 2021
<Oct 7, 2021 10:19:14,075 AM CST> <Warning> <RMI> <BEA-080006> <Failed to associate the transaction context with the response while marshalling an exception to the client:
 java.rmi.ConnectException: This RJVM has already been shutdown 3659578547004677836S:10.1.30.73:[7001,-1,-1,-1,-1,-1,-1]:proddomain:AdminServer.
java.rmi.ConnectException: This RJVM has already been shutdown 3659578547004677836S:10.1.30.73:[7001,-1,-1,-1,-1,-1,-1]:proddomain:AdminServer
        at weblogic.rjvm.RJVMImpl.connectionCheck(RJVMImpl.java:843)
        at weblogic.rjvm.RJVMImpl.getOutputStreamByName(RJVMImpl.java:506)
        at weblogic.rjvm.RJVMImpl.getResponseStream(RJVMImpl.java:948)
        at weblogic.rjvm.MsgAbbrevInputStream.getMsgAbbrevOutputStream(MsgAbbrevInputStream.java:321)
        at weblogic.rjvm.MsgAbbrevInputStream.getOutboundResponse(MsgAbbrevInputStream.java:605)
        Truncated. see log file for complete stacktrace
> 
[2021-10-07 10:06:22]ERROR org.quartz.core.JobRunShell(line:211) -Job DEFAULT.immediatelyTaskScanerDetail threw an unhandled Exception: 
org.springframework.scheduling.quartz.JobMethodInvocationFailedException: Invocation of method 'execScanTask' on target class [class controller.taskScanner.ImmediatelyTaskScaner] failed; nested exception is java.lang.OutOfMemoryError: Java heap space
        at org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:273)
        at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:529)
Caused by: java.lang.OutOfMemoryError: Java heap space
[2021-10-07 10:18:11]ERROR org.quartz.core.JobRunShell(line:211) -Job DEFAULT.immediatelyTaskScanerDetail threw an unhandled Exception: 
org.springframework.scheduling.quartz.JobMethodInvocationFailedException: Invocation of method 'execScanTask' on target class [class controller.taskScanner.ImmediatelyTaskScaner] failed; nested exception is java.lang.OutOfMemoryError: Java heap space
        at org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:273)
        at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:529)
Caused by: java.lang.OutOfMemoryError: Java heap space
<Oct 7, 2021 10:19:14,086 AM CST> <Error> <Security> <BEA-090975> <The connection between the WebLogic servers cannot be authorized since the nonce value has expired. Verify that clock skew has not occurred between the servers in the domain.> 
(END)

这次的崩溃生成了完整的6G的headdump文件,通过MAT分析后可以得到有两个进程占用了比较多的资源:

图片.png

给出的overview如下:

图片.png

图片.png

通过查看details,两个进程都是org.springframework.beans.factory.support.DefaultListableBeanFactory类在调用java.util.HashMap$Node[]方法,且在运行中该方法占用了较多资源且没有及时释放。

通过查看java.util.HashMap$Node[]调用资源的路径发现:一个是有线程占用了资源没有释放,下面是堆栈信息:(隐去项目细节内容)

<Java Local> weblogic.work.ExecuteThread @ 0x70001cc48 [STUCK] ExecuteThread: '69' for queue: 'weblogic.kernel.Default (self-tuning)' Thread  

另外就是在报告中显示当时进程也创建了如下大量的对象:

图片.png 以下是其中一个线程中调用eo各种类的情况,其中eo.version.process.Handler创建了365391个对象,占用了321M的资源的资源;

eo.version.business.BusinessElement创建了363,309个对象,占用了月465M的资源,加上其他eo类,共计999259个对象,1.9G的资源,以上资源都是在进程崩溃时正在运行的。需要项目组回溯以下程序看以下为什么创建了这么多的对象。

图片.png