线上环境,客户提出来一个紧急问题,是报表数据生产失败的问题,经过排查得知是feign调用超时问题,现在讲本次线上问题排查思路梳理如下:
一.问题还原
A服务 调用B服务 接口,传递参数有 租户id,开始时间,结束时间,B服务会查询租户的数据返回,但是从A服务看,等到60秒后会超时,如下所示:
2024-04-02 10:36:04.456 ERROR 28305 --- [http-nio-5678-exec-4] c.d.x.c.b.e.GlobalExceptionHandler : [ControllerErrorHandler系统错误], errorStack: com.netflix.hystrix.exception.HystrixRuntimeException: CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.
at com.netflix.hystrix.AbstractCommand$22.call(AbstractCommand.java:822)
at com.netflix.hystrix.AbstractCommand$22.call(AbstractCommand.java:807)
at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:140)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at com.netflix.hystrix.AbstractCommand$DeprecatedOnFallbackHookApplication$1.onError(AbstractCommand.java:1472)
at com.netflix.hystrix.AbstractCommand$FallbackHookApplication$1.onError(AbstractCommand.java:1397)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at rx.observers.Subscribers$5.onError(Subscribers.java:230)
at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:44)
at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:28)
at rx.Observable.unsafeSubscribe(Observable.java:10327)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:51)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:35)
at rx.Observable.unsafeSubscribe(Observable.java:10327)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10327)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10327)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10327)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10327)
at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:142)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$1.run(AbstractCommand.java:1142)
at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable$1.call(HystrixContextRunnable.java:41)
at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable$1.call(HystrixContextRunnable.java:37)
at com.alibaba.ttl.TtlCallable.call(TtlCallable.java:65)
at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable.run(HystrixContextRunnable.java:57)
at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$2.tick(AbstractCommand.java:1159)
at com.netflix.hystrix.util.HystrixTimer$1.run(HystrixTimer.java:99)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
at com.netflix.hystrix.AbstractCommand.handleTimeoutViaFallback(AbstractCommand.java:997)
at com.netflix.hystrix.AbstractCommand.access$500(AbstractCommand.java:60)
at com.netflix.hystrix.AbstractCommand$12.call(AbstractCommand.java:609)
at com.netflix.hystrix.AbstractCommand$12.call(AbstractCommand.java:601)
at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:140)
... 16 more
经过排查,系统中feign调用的超时配置的确实是60秒,使用trace方法进一步验证A服务的接口超时问题:
[arthas@28305]$ trace com.xx.CommonFeignClient getHostListReport
Press Q or Ctrl+C to abort.
Affect(class count: 2 , method count: 1) cost in 3086 ms, listenerId: 1
`---ts=2024-04-02 10:41:07;thread_name=http-nio-5678-exec-5;id=d6;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
`---[60008.446404ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
`---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]
`---ts=2024-04-02 10:41:22;thread_name=http-nio-5678-exec-7;id=d8;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
`---[60005.811154ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
`---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]
`---ts=2024-04-02 10:43:24;thread_name=http-nio-5678-exec-8;id=d9;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
`---[60006.744468ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
`---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]
`---ts=2024-04-02 10:44:40;thread_name=http-nio-5678-exec-10;id=db;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
`---[60008.893705ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
`---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]
`---ts=2024-04-02 10:53:06;thread_name=http-nio-5678-exec-3;id=d4;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
`---[60006.06974ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
`---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]
`---ts=2024-04-02 10:57:26;thread_name=http-nio-5678-exec-6;id=d7;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@30ed9c6c
`---[60005.968829ms] com.sun.proxy.$Proxy164:getHostListReport() [throws Exception]
`---throw:com.netflix.hystrix.exception.HystrixRuntimeException #822 [CommonFeignClient#getHostListReport(HostReportQueryDTO) timed-out and no fallback available.]
能看出每次接口超过60秒就会报超时,接下来,就需要排查B服务是否收到请求
2.B服务排查
经过查看日志,B服务这个接口是没有添加任何日志的,导致如果只是简单的查询日志,是看不到任何请求的,只能通过arthas排查,使用trace命令查看接口的请求耗时
`---ts=2024-04-02 10:48:27;thread_name=http-nio-5668-exec-59;id=3385c;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@7c2b6087
`---[191015.262468ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl$$EnhancerBySpringCGLIB$$4ebc7a2b:getHostListReport()
`---[191015.224267ms] org.springframework.cglib.proxy.MethodInterceptor:intercept()
`---[191015.188767ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostListReport()
+---[0.00905ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:<init>() #2273
+---[0.00713ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2276
+---[0.00642ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2277
+---[0.228765ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2278
+---[10.255409ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectList() #2280
+---[0.00865ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setNewInstallCount() #2283
+---[0.0069ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setStarteDate() #2286
+---[0.006421ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setEndDate() #2287
+---[0.0064ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2288
+---[0.00662ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2289
+---[0.006791ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2289
+---[0.037051ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2290
+---[9.937732ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2291
+---[0.00788ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setTotalCount() #2292
+---[0.016691ms] com.dbapp.ahcloud.network.element.enums.HostStatusEnum:getStatus() #2295
+---[0.00705ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2295
+---[0.0067ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2296
+---[0.00676ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2296
+---[0.048871ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2297
+---[8.386791ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2298
+---[0.00827ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setOnLineCount() #2299
+---[0.00656ms] com.dbapp.ahcloud.network.element.enums.HostStatusEnum:getStatus() #2302
+---[0.00607ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2302
+---[0.00549ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2303
+---[0.00588ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2303
+---[0.056201ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2304
+---[7.170546ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2305
+---[0.02316ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setOutLineCount() #2306
+---[0.00639ms] com.dbapp.ahcloud.network.element.enums.AgentStatusEnum:getStatus() #2309
+---[0.00558ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2309
+---[0.005ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setHostStatus() #2310
+---[0.034811ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2311
+---[5.88444ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectCount() #2312
+---[0.00755ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setUnloadCount() #2313
+---[0.00567ms] com.dbapp.ahcloud.network.element.dto.host.HostReportQueryDTO:setAgentStatus() #2315
+---[0.03573ms] com.dbapp.ahcloud.network.element.service.impl.HostServiceImpl:getHostCountWrapper() #2316
+---[115.476815ms] com.dbapp.ahcloud.network.element.dao.HostMapper:selectList() #2318
`---[0.01645ms] com.dbapp.ahcloud.network.element.dto.host.HostReportResDTO:setHostList() #2341
从这个里面看到这个接口是有请求进来的,显示这个方法耗时191015.262468ms,也就是说耗时190秒,但是通过这个方法下面的参数,最长的一个也就耗时 115.476815ms(0.115秒),从这里看到这上面的时间相加,也就一两秒,为啥这里会耗时190秒呢?
继续分析
从这个图上看,执行到最后一步之前,一共耗时也就1秒左右,但是执行到最后一步后,却耗时190秒,所以很大概率是最后一步在setList的时候,有问题,所以接着看代码:
这个人前面会查出来几千条数据,然后通过for循环去查询这个租户的数据,本地数据库里面3041条,就是要调用3千多次......而且这些数据是一样的,What the fuck ~~~~ 这还是一个做了十几年java的人写的代码
3.再次验证B服务返回的结果
watch com.xxx.ConsoleHostController getHostListReport "{params,returnObj,throwExp}" -x 3
method=com.xxx.ConsoleHostController.getHostListReport location=AtExit
ts=2024-04-02 11:00:44; [cost=198116.839392ms] result=@ArrayList[
@Object[][
@HostReportQueryDTO[
starteDate=null,
endDate=null,
tenantId=@Long[160890167693314],
vmType=@String[av],
agentStatus=null,
hostStatus=null,
],
],
@MsspResponse[
data=@HostReportResDTO[
totalCount=@Integer[2864],
newInstallCount=@Integer[4],
activeCount=null,
onLineCount=@Integer[836],
outLineCount=@Integer[2028],
unloadCount=@Integer[177],
HostList=@ArrayList[isEmpty=false;size=3041],
],
msg=@String[操作成功],
debugMsg=null,
code=@Integer[0],
],
null,
]
从返回的结果 HostList=@ArrayList[isEmpty=false;size=3041],,也能证明本次返回的数据是3041条,然后不停的循环去调用feign来获取同一条数据........
二.总结
如果线上环境,代码中没有手动打印任何日志的情况,合理使用arthas能进行快速的问题排查