本文为稀土掘金技术社区首发签约文章,30 天内禁止转载,30 天后未获授权禁止转载,侵权必究!
本文为 Android App ANR 系列的第三篇,主要分享几个 ANR 的案例,系列文章目录如下
- Android App ANR 系列 1 :深入理解 Android ANR 机制 (已完成,代码示例更新中)
- Android App ANR 系列 2 :ANR 分析套路和关键 Log 介绍
- Android App ANR 系列 3 :ANR 案例分享
ANR(Application Not Responding),应用程序无响应,简单一个定义,却涵盖了很多 Android 系统的设计思想
首先,ANR 属于应用程序的范畴。这不同于 SNR(System Not Respoding),SNR 反映的问题是系统进程(system_server)失去了响应能力,而 ANR 明确将问题圈定在应用程序。SNR 由 Watchdog 机制保证,具体可以查阅 Watchdog 机制以及问题分析; ANR 由消息处理机制保证,Android 在系统层实现了一套精密的机制来发现 ANR,核心原理是消息调度和超时处理
其次,ANR 机制主体实现在系统层。所有与 ANR 相关的消息,都会经过系统进程(system_server)调度,然后派发到应用进程完成对消息的实际处理,同时,系统进程设计了不同的超时限制来跟踪消息的处理。 一旦应用程序处理消息不当,超时限制就起作用了,它收集一些系统状态,譬如 CPU/IO 使用情况、进程函数调用栈,并且报告用户有进程无响应了(ANR 对话框,部分 Rom 不显示 ANR 对话框,而是直接闪退到主界面)
然后,ANR 问题本质是一个性能问题。ANR 机制实际上对应用程序主线程的限制,要求主线程在限定的时间内处理完一些最常见的操作(启动服务、处理广播、处理输入), 如果处理超时,则认为主线程已经失去了响应其他操作的能力。主线程中的耗时操作,譬如密集 CPU 运算、大量 IO、复杂界面布局等,都会降低应用程序的响应能力
最后,部分 ANR 问题是很难分析的。有时候由于系统底层的一些影响,导致消息调度失败,出现问题的场景又难以复现。 这类 ANR 问题往往需要花费大量的时间去了解系统的一些行为,超出了 ANR 机制本身的范畴。有一些 ANR 问题很难调查清楚,因为整个系统不稳定的因素很多,例如 Linux Kernel 本身的 Bug 引起的内存碎片过多、硬件损坏等。这类比较底层的原因引起的 ANR 问题往往无从查起,并且这根本不是应用程序的问题,浪费了应用开发人员很多时间,如果你从事过整个系统的开发和维护工作的话会深有体会。所以我不能保证了解了本章的所有内容后能够解决一切 ANR 问题,如果出现了很疑难的 ANR 问题,我建议最好去和做 Framework、驱动和内核的朋友聊聊,或者,如果问题只是个十万分之一的偶然现象,不影响程序的正常运行,我倒是建议不去理它
-- From duanqz
ANR 常见原因
对于 ANR 的原因,通常要做到 :大胆假设,小心求证 。发现异常的地方提取之后,先假设是这里的问题导致的,然后以这个假设为出发点,看前后的 Log 看看是否能支持自己的假设,如果不能,那么换一个点.
问题出在当前进程
- 死锁
- 主线程调用 thread 的 join()方法、sleep()方法、wait()方法或者等待线程锁的时候
- 主线程阻塞在 nSyncDraw
- 主线程耗时操作,如复杂的 layout,庞大的 for 循环,IO 等
- 主线程被子线程同步锁 block
- 主线程等待子线程超时
- 主线程 Activity 生命周期函数执行超时
- 主线程 Service 生命周期函数执行超时
- 主线程 Broadcast.onReceive 函数执行超时(即使调用了 goAsync )
- 渲染线程耗时
- 耗时的网络访问
- 大量的数据读写
- 数据库操作
- 硬件操作(比如 Camera)
- service binder 的数量达到上限
- 其它线程终止或崩溃导致主线程一直等待
- Dump 内存操作
- 大量 SharedPerference 同时读写
问题出在远端进程或者系统
- 与 SystemServer 进行 Binder 通信,SystemServer 执行耗时
- 方法本身执行耗时导致超时
- SystemServer Binder 锁竞争太多,导致等锁超时
- 等待其他进程返回超时,比如从其他进程的 ContentProvider 中获取数据超时
- Window 错乱导致 Input 超时
- ContentProvider 对应的进程频繁崩溃,也会杀掉当前进程
- 整机低内存
- 整机 CPU 占用高
- 整机 IO 使用率高
- SurfaceFlinger 超时
- 系统冻结功能出现 Bug
- System Server 中 WatchDog 出现 ANR
- 整机触发温控限制频率
ANR 案例分享
ANR 案例:头条 - 死锁
主要操作是频繁的从侧边栏拉出今日头条进行分屏操作,多次操作之后,应用发生了 ANR,这个会导致手机短暂的在分屏栏中的应用界面黑屏并卡死。不过 4-5 秒之后又正常。
应用发生 ANR 的原因是自身主线程被阻塞导致。
"main"prio=5tid=1Blocked
| group="main"sCount=1dsCount=0obj=0x74f9bbe8self=0xe7084400
| sysTid=28210nice=0cgrp=defaultsched=0/0handle=0xe9dcd534
| state=S schedstat=(13454428309928953492) utm=121stm=13core=3HZ=100
| stack=0xff3b6000-0xff3b8000stackSize=8MB
| held mutexes=
at com.ss.android.common.applog.LogReaper.insertCrashLog(SourceFile:98)
- waiting to lock <0x0d3fbd00> (a com.ss.android.common.applog.LogReaper) held by thread34
at com.ss.android.common.applog.AppLog.uncaughtException(SourceFile:1408
at u.aly.n.uncaughtException(SourceFile:34)
at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1068)
at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1063)
关键信息
waiting to lock <0x0d3fbd00> (a com.ss.android.common.applog.LogReaper) held by thread34
因此,在下面的 ANR 日志总,查找 tid==34
"LogReaper"prio=5tid=34TimedWaiting
| group="main"sCount=1dsCount=0obj=0x12fcaba0self=0xcb226e00
| sysTid=28274nice=0cgrp=defaultsched=0/0handle=0xc9f9b920
| state=S schedstat=(77341565269270880) utm=7stm=0core=5HZ=100
| stack=0xc9e99000-0xc9e9b000stackSize=1038KB
| held mutexes=
at java.lang.Object.wait!(Native method)
- waiting on <0x00fc7065> (a java.util.concurrent.atomic.AtomicInteger)
at java.lang.Object.wait(Object.java:407)
at com.ss.android.action.b.d.a(SourceFile:216)
at com.ss.android.newmedia.b.onLogSessionBatchEvent(SourceFile:468)
at com.ss.android.common.applog.DBHelper.batchSession(SourceFile:616)
- locked <0x0d4ff1c4> (a com.ss.android.common.applog.DBHelper)
at com.ss.android.common.applog.LogReaper.switchSession(SourceFile:175)
at com.ss.android.common.applog.LogReaper.switchSession(SourceFile:153)
at com.ss.android.common.applog.LogReaper.processItem(SourceFile:122)
- locked <0x0d3fbd00> (a com.ss.android.common.applog.LogReaper)
at com.ss.android.common.applog.LogReaper.run(SourceFile:632)
关键信息:
waiting on <0x00fc7065> (a java.util.concurrent.atomic.AtomicInteger)
应用内代码阻塞引起的
ANR 案例:冻结导致 ANR
搜索 am_anr 查看 anr 发生的时间
05-0100:51:39.594 1449 5234I am_anr : [0,2169,com.xxx.weather2,820526660,Input dispatching timed out (Waiting to send key event because the focused window has not finished processing all of the input events that were previously delivered to it. Outbound queue length: 0. Wait queue length: 1.)]
从 ANR 描述可以看到,是当前的 Window 还在处理上一个 Input 事件超时,导致新的事件没有被及时处理,所以发生了 ANR
搜索 ANR in 查看当时的 cpu 信息
05-0100:51:53.974 1449 5234E ActivityManager: ANR in com.xxx.weather2 (com.xxx.weather2/com.xxx.weather.xxxMainActivity)
05-0100:51:53.974 1449 5234E ActivityManager: PID:2169
05-0100:51:53.974 1449 5234E ActivityManager: Reason:Input dispatching timed out (Waiting to send key event because the focused window has not finished processing all of the input events that were previously delivered to it. Outbound queue length: 0. Wait queue length: 1.)
05-01 00:51:53.974 1449 5234 E ActivityManager: Parent: com.xxx.weather2/com.xxx.weather.xxxMainActivity
05-01 00:51:53.974 1449 5234 E ActivityManager: Load: 29.89 / 31.82 / 32.27
05-01 00:51:53.974 1449 5234 E ActivityManager: CPU usage from 4583ms to 12043ms later (2020-05-01 00:51:44.177 to 2020-05-01 00:51:51.637):
05-01 00:51:53.974 1449 5234 E ActivityManager: 18% 1449/system_server: 9.1% user + 9.3% kernel / faults: 7819 minor 1 major
05-01 00:51:53.974 1449 5234 E ActivityManager: 10% 720/surfaceflinger: 6.2% user + 4.4% kernel / faults: 734 minor 26 major
05-01 00:51:53.974 1449 5234 E ActivityManager: 10% 651/android.hardware.audio@5.0-service-mediatek: 9.5% user + 0.8% kernel / faults: 1 minor 4 major
可以看到当时的 cpu 并不繁忙,所以不是系统负载原因导致的 ANR,那么后续分析就要从 Log 中抽取对应的信息,来分析超时的 input 事件发生前后系统在做什么
分析过程如下
- 找到 ANR 发生时候的具体时间:00:51:39.594
- 从 5s 前的 Log 开始看,找到有用的信息:00:51:34.478
- 发现这个时间点并没有什么异常,只是应用没有响应 key_back_down,导致 5s 后报了 anr
- 从 ANR 之后的 Log 中发现,xxxHansManager : unfreeze uid: 10182 package: com.xxx.weather2 reason: Signal 这句 Log,说明 com.xxx.weather2 之前是冻结状态,这里进行了解冻,那么可以大胆猜想,是不是 com.xxx.weather2 被冻结之后,无法响应事件导致的
- 搜索冻结 Log,可以看到在 00:51:04.000 的时候,系统把 com.xxx.weather2 冻结了 xxxHansManager : freeze uid: 10182 package: com.xxx.weather2 reason: LcdOff
- 这个需要让系统看是否是冻结的逻辑问题
由于 LcdOff, 所以启动冻,weather2 被冻结 :com.xxx.weather2 reason: LcdOff
05-01 00:51:03.994 1449 2187 I xxxHansManager : freeze uid: 10245 package: com.tencent.mm reason: LcdOff
05-01 00:51:03.994 1449 2187 D xxxListManagerImpl: com.tencent.mm in autoStart list!
05-01 00:51:03.998 1449 2187 I xxxHansManager : freeze uid: 10246 package: com.xunmeng.pinduoduo reason: LcdOff
05-01 00:51:03.998 1449 2187 D xxxListManagerImpl: com.xunmeng.pinduoduo in autoStart list!
05-01 00:51:04.000 1449 2187 I xxxHansManager : freeze uid: 10182 package: com.xxx.weather2 reason: LcdOff
05-01 00:51:04.002 645 30914 I netd : firewallSetUidRule(4, 10182, 2) <0.09ms>
05-01 00:51:04.005 1449 2187 I xxxHansManager : freeze uid: 10187 package: com.heytap.yoli reason: LcdOff
05-01 00:51:04.005 1449 2187 D xxxListManagerImpl: com.heytap.yoli in autoStart list!
05-01 00:51:04.005 1449 2187 I xxxHansManager : isHansImportantCase uid: 10196 pkg: cn.kuwo.player reason: audiofocus
05-01 00:51:04.007 1449 2187 I xxxHansManager : freeze uid: 10234 package: com.tencent.mobileqq reason: LcdOff
05-01 00:51:04.008 1449 2187 D xxxListManagerImpl: com.tencent.mobileqq in autoStart list!
05-01 00:51:04.014 1449 2187 I xxxHansManager : freeze uid: 10235 package: com.tencent.qqlive reason: LcdOff
weather2 接收到 KEYCODE_BACK 的 ACTION_DOWN
05-01 00:51:04.471 1449 9842 D xxxPhoneWindowManager: interceptKeyBeforeQueueing:KeyEvent { action=ACTION_DOWN, keyCode=KEYCODE_BACK, scanCode=0, metaState=0, flags=0x8, repeatCount=0, eventTime=38812125, downTime=38812125, deviceId=-1, source=0x101, displayId=0 }
05-01 00:51:04.471 1449 9842 I sysui_multi_action: [757,803,799,key_back_down,802,1]
05-01 00:51:04.477 1449 1620 D xxxPhoneWindowManager: interceptKeyBeforeDispatching key: win=Window{1b36ae8 u0 com.xxx.weather2/com.xxx.weather.xxxMainActivity} event = KeyEvent { action=ACTION_DOWN, keyCode=KEYCODE_BACK, scanCode=0, metaState=0, flags=0x8, repeatCount=0, eventTime=38812125, downTime=38812125, deviceId=-1, source=0x101, displayId=0 }
05-01 00:51:04.477 1449 1620 D xxxPhoneWindowManager: interceptKeyBeforeDispatching newEvent keyCode = 4
weather2 接收到 KEYCODE_BACK 的 ACTION_UP
05-01 00:51:34.478 1449 1620 D xxxPhoneWindowManager: interceptKeyBeforeDispatching key: win=Window{1b36ae8 u0 com.xxx.weather2/com.xxx.weather.xxxMainActivity} event = KeyEvent { action=ACTION_UP, keyCode=KEYCODE_BACK, scanCode=0, metaState=0, flags=0x8, repeatCount=0, eventTime=38842126, downTime=38842126, deviceId=-1, source=0x101, displayId=0 }
05-01 00:51:34.478 1449 1620 D xxxPhoneWindowManager: interceptKeyBeforeDispatching newEvent keyCode = 4
KEYCODE_BACK 的 ACTION_UP 5s 没有响应,触发 ANR
05-01 00:51:39.484 1449 1620 I WindowManager: Input event dispatching timed out sending to com.xxx.weather2/com.xxx.weather.xxxMainActivity. Reason: Waiting to send key event because the focused window has not finished processing all of the input events that were previously delivered to it. Outbound queue length: 0. Wait queue length: 1.
05-01 00:51:39.484 2256 5518 I QUALITY-TOTAL: exp: anr
05-01 00:51:39.594 1449 5234 I am_anr : [0,2169,com.xxx.weather2,820526660,Input dispatching timed out (Waiting to send key event because the focused window has not finished processing all of the input events that were previously delivered to it. Outbound queue length: 0. Wait queue length: 1.)]
05-01 00:51:39.593 1449 1620 D ActivityManager: ANR post Runnable for ProcessRecord{ae0c833 2169:com.xxx.weather2/u0a182} to deal with anr happend at 38847248@#@2169
05-01 00:51:39.593 1449 1620 D ActivityManager: ANR caller(2) = com.android.server.am.ActivityManagerService$LocalService.inputDispatchingTimedOut:19872 com.android.server.wm.ActivityRecord.keyDispatchingTimedOut:2641 com.android.server.wm.AppWindowToken.keyDispatchingTimedOut:2007 com.android.server.wm.InputManagerCallback.notifyANR:111 com.android.server.input.InputManagerService.notifyANR:1822 <bottom of call stack> <bottom of call stack> <bottom of call stack>
系统发 Signal 3 给 weather2(SIGNAL_QUIT = 3,给到 Signal Catcher 线程用于输出 Trace)
05-01 00:51:40.868 2169 2202 I oloros.weather: Thread[7,tid=2202,WaitingInMainSignalCatcherLoop,Thread*=0xd9878400,peer=0x138c0250,"Signal Catcher"]: reacting to signal
系统检测到 Signal 给到的 weather2, 所以给 com.xxx.weather2 解冻
05-01 00:51:40.869 1449 9811 I xxxHansManager : unfreeze uid: 10182 package: com.xxx.weather2 reason: Signal
05-01 00:51:40.869 2169 2202 I oloros.weather:
05-01 00:51:40.869 2169 2169 W ViewRootImpl[xxxMainActivity]: Dropping event due to no window focus: KeyEvent { action=ACTION_DOWN, keyCode=KEYCODE_BACK, scanCode=0, metaState=0, flags=0x8, repeatCount=0, eventTime=38812125, downTime=38812125, deviceId=-1, source=0x101, displayId=0 }, hasFocus:true, mStopped:true, mPausedForTransition:false
05-01 00:51:40.869 2169 2169 W ViewRootImpl[xxxMainActivity]: Cancelling event due to no window focus: KeyEvent { action=ACTION_UP, keyCode=KEYCODE_BACK, scanCode=0, metaState=0, flags=0x28, repeatCount=0, eventTime=38847249, downTime=38812125, deviceId=-1, source=0x101, displayId=0 }, hasFocus:true, mStopped:true, mPausedForTransition:false
可以看到这里 weather2 卡了 36391.9ms,也就是从 00:51:04.471(KEYCODE_BACK 发出) 到 00:51:40.869(当前时间) 这段时间
05-01 00:51:40.869 1449 1620 I InputDispatcher: Window '1b36ae8 com.xxx.weather2/com.xxx.weather.xxxMainActivity (server)' spent 36391.9ms processing the last input event: KeyEvent
05-01 00:51:40.869 2169 2169 V ViewRootImpl[xxxMainActivity]: Sending input event to IME: KeyEvent { action=ACTION_UP, keyCode=KEYCODE_BACK, scanCode=0, metaState=0, flags=0x28, repeatCount=0, eventTime=38847249, downTime=38812125, deviceId=-1, source=0x101, displayId=0 }
05-01 00:51:40.870 2169 2169 I Choreographer: Skipped 2212 frames! The application may be doing too much work on its main thread.
ANR 案例:Broadcast 超时导致 ANR
这个案例 Log 缺失,具体就是用户在进行迁移,BroadcastReceiver 的 onReceive 中会起一个线程处理,最终报的是 BroadcastReceiver ANR
04-0717:03:17.174435 1448 1476V WindowManager: Lookingforfocus: Window{efa2a4f u0 正在升级数据库,数据量较大,请耐心等待}, flags=25296898, canReceive=true1
04-0717:03:17.174459 1448 1476V WindowManager: findFocusedWindow: Foundnewfocus @ Window{efa2a4f u0 正在升级数据库,数据量较大,请耐心等待}2
分析如下:
- 日历后台广播 ANR 的原因是 onReceive 执行超时,之前没有看出来是因为 onReceive 里面耗时操作是用的 new Thread 的操作,以为在子线程里面做耗时操作,就不会影响后面广播的执行
- 后面发现 onReceive 里面调用了 PendingResult result = goAsync(); 这句话会在有序广播接收器执行的时候,可以在子线程执行耗时操作,而不会影响 receiver 的生命周期.这个方法非常简单,返回 mPendingResult 并将其设置为 null。
- 如果我们在 onReceive 方法中调用该方法,这意味着广播处理流程被打断了,当 onReceive 方法执行完毕,由于 mPendingResult 为 null,因此并不会马上回调 AMS.finishReceiver 方法。而且由于 goAsync 返回了 PendingResult,因此我们可以任意时刻、任意线程去调用 PendingResult.finish 去回调 AMS。相当于将一个同步回调变成了异步回调。而在这异步回调过程中,我们可以新起线程进行一些耗时的 IO 操作等等。简单来说,goAsync 提供了一种机制,让我们可以在异步线程中处理广播消息,以防止主线程被阻塞。juejin.im/post/5c15fc…
- 这个案例中,虽然在 onReceive 中使用了线程去处理耗时任务,但是由于调用了 goAsync,所以还是会计算超时时间,如果在规定的时间内没有完成,就算是在子线程,也会触发 BroadcastReceiver 的 ANR
可以看 goAsync 这个方法的官方注释:
/**
* This can be called by an application in {@link #onReceive} to allow
* it to keep the broadcast active after returning from that function.
* This does <em>not</em> change the expectation of being relatively
* responsive to the broadcast, but does allow
* the implementation to move work related to it over to another thread
* to avoid glitching the main UI thread due to disk IO.
*
* <p>As a general rule, broadcast receivers are allowed to run for up to 10 seconds
* before they system will consider them non-responsive and ANR the app. Since these usually
* execute on the app's main thread, they are already bound by the ~5 second time limit
* of various operations that can happen there (not to mention just avoiding UI jank), so
* the receive limit is generally not of concern. However, once you use {@code goAsync}, though
* able to be off the main thread, the broadcast execution limit still applies, and that
* includes the time spent between calling this method and ultimately
* {@link PendingResult#finish() PendingResult.finish()}.</p>
*
* <p>If you are taking advantage of this method to have more time to execute, it is useful
* to know that the available time can be longer in certain situations. In particular, if
* the broadcast you are receiving is not a foreground broadcast (that is, the sender has not
* used {@link Intent#FLAG_RECEIVER_FOREGROUND}), then more time is allowed for the receivers
* to run, allowing them to execute for 30 seconds or even a bit more. This is something that
* receivers should rarely take advantage of (long work should be punted to another system
* facility such as {@link android.app.job.JobScheduler}, {@link android.app.Service}, or
* see especially {@link android.support.v4.app.JobIntentService}), but can be useful in
* certain rare cases where it is necessary to do some work as soon as the broadcast is
* delivered. Keep in mind that the work you do here will block further broadcasts until
* it completes, so taking advantage of this at all excessively can be counter-productive
* and cause later events to be received more slowly.</p>
*
* @return Returns a {@link PendingResult} representing the result of
* the active broadcast. The BroadcastRecord itself is no longer active;
* all data and other interaction must go through {@link PendingResult}
* APIs. The {@link PendingResult#finish PendingResult.finish()} method
* must be called once processing of the broadcast is done.
*/
public final PendingResult goAsync(){
PendingResult res = mPendingResult;
mPendingResult =null;
returnres;
}
其用法如下
@Override1
public void onReceive(final Context context, final Intent intent){
finalPendingResult result = goAsync();
Runnable worker =newRunnable() {
@Override
public void run(){
onReceiveAsync(context, intent);
result.finish();
}
};
mAsyncHandler.post(worker);
}
ANR 案例:Launcher - Input ANR
搜索 am_anr 找到 ANR 发生的时间点
- 属于 input dispatch anr
- 属于应用处理 input 事件超时(而不是 no focus window)
19:44:56.815 2515 25056 I am_anr : [0,3365,com.xxx.launcher,819478085,Input dispatching timed out (Waiting to send key event because the focused window has not finished processing all of the input events that were previously delivered to it. Outbound queue length: 0. Wait queue length: 9.)]
搜索 ANR in 找到 cpu 使用情况 1.从 Load : 0.02 / 0.01 / 0.0 来看,整机的负载并不高,大概率是逻辑导致的 ANR. 2.logd 和 SurfaceFlinger 的 cpu 使用略高,可以当做怀疑的对象重点观察.
1-0119:45:07.753 251525056E ActivityManager: ANR in com.xxx.launcher (com.xxx.launcher/.Launcher)
19:45:07.753 251525056E ActivityManager: PID:3365
19:45:07.753 251525056E ActivityManager: Reason: Input dispatching timed out (Waiting to send key event because the focused window hasnotfinished processing all of the input events that were previously delivered to it. Outbound queue length:0. Wait queue length:9.)
19:45:07.753 251525056E ActivityManager: Parent: com.xxx.launcher/.Launcher
19:45:07.753 251525056E ActivityManager: Load:0.02/0.01/0.0
19:45:07.753 251525056E ActivityManager: CPU usage from0ms to10936ms later (2020- 19:44:56.751to2020- 19:45:07.688):
19:45:07.753 251525056E ActivityManager: 97%546/logd:97% user +0.2% kernel / faults:12minor
19:45:07.753 251525056E ActivityManager: 25%956/surfaceflinger:11% user +13% kernel / faults:151minor17major
19:45:07.753 251525056E ActivityManager: 1.4%1664/media.codec:0.9% user +0.5% kernel / faults:28729minor4major
19:45:07.753 251525056E ActivityManager: 7.5%2515/system_server:2.9% user +4.6% kernel / faults:1818minor8major
19:45:07.753 251525056E ActivityManager: 3.2%3365/com.xxx.launcher:2.1% user +1% kernel / faults:3535minor29major
19:45:07.753 251525056E ActivityManager: 2%3620/com.xxx.persist.system:1.2% user +0.7% kernel / faults:870minor
分析 trace.txt 查看是否有有用的信息,首先看 Launcher 的主线程堆栈
"main"prio=5tid=1Native
| group="main"sCount=1dsCount=0flags=1obj=0x72a50cd0self=0x6f7ae10800
| sysTid=3365nice=-10cgrp=defaultsched=0/0handle=0x700123bed8
| state=S schedstat=(37451337214473599138010494263) utm=26967stm=10484core=0HZ=100
|stack=0x7fc0270000-0x7fc0272000stackSize=8192KB
| held mutexes=
kernel: (couldn't read /proc/self/task/3365/stack)
native: #00pc0000000000071a8c /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
native: #01pc0000000000075710 /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(voidvolatile*,bool,int,bool, timespecconst*)+140)
native: #02pc00000000000d9744 /apex/com.android.runtime/lib64/bionic/libc.so (pthread_cond_wait+60)
native: #03pc00000000002bf5e8 /system/lib64/libhwui.so (android::uirenderer::renderthread::DrawFrameTask::postAndWait()+168)
native: #04pc00000000002bf510 /system/lib64/libhwui.so (android::uirenderer::renderthread::DrawFrameTask::drawFrame()+44)12
at android.graphics.HardwareRenderer.nSyncAndDrawFrame(Native method)
at android.graphics.HardwareRenderer.syncAndDrawFrame(HardwareRenderer.java:422)
at android.view.ThreadedRenderer.draw(ThreadedRenderer.java:671)
at android.view.ViewRootImpl.draw(ViewRootImpl.java:3983)
- locked <0x030694da> (a java.lang.Object)
at android.view.ViewRootImpl.performDraw(ViewRootImpl.java:3782)
at android.view.ViewRootImpl.performTraversals(ViewRootImpl.java:3085)
at android.view.ViewRootImpl.doTraversal(ViewRootImpl.java:1994)
at android.view.ViewRootImpl$TraversalRunnable.run(ViewRootImpl.java:8201)
at android.view.Choreographer$CallbackRecord.run(Choreographer.java:1085)
at android.view.Choreographer.doCallbacks(Choreographer.java:908)
at android.view.Choreographer.doFrame(Choreographer.java:835)
at android.view.Choreographer$FrameHandler.handleMessage(Choreographer.java:1013)
at android.os.Handler.dispatchMessage(Handler.java:107)
at android.os.Looper.loop(Looper.java:238)
at android.app.ActivityThread.main(ActivityThread.java:7798)
at java.lang.reflect.Method.invoke(Native method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:995)
可以看到这里主线程阻塞在了等待 RenderThread 返回,那么继续查看 RenderThread 的堆栈
"RenderThread"daemon prio=7tid=25Native
| group="main"sCount=1dsCount=0flags=1obj=0x13100b00self=0x6f0c698c00
| sysTid=3616nice=-10cgrp=defaultsched=0/0handle=0x6f072e0d50
| state=S schedstat=(35808915205059629728352628704) utm=29226stm=6582core=3HZ=100
|stack=0x6f071ea000-0x6f071ec000stackSize=991KB
| held mutexes=
kernel: (couldn't read /proc/self/task/3616/stack)
native: #00pc00000000000c49f4 /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+4)
native: #01pc000000000007d518 /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+132)
native: #02pc0000000000059e58 /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+244)
native: #03pc000000000005ad94 /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*,int*)+60
native: #04pc000000000005ab38 /system/lib64/libbinder.so (android::IPCThreadState::transact(int,unsignedint, android::Parcelconst&, android::Parcel*,unsignedint)+180)
native: #05pc000000000004f000 /system/lib64/libbinder.so (android::BpBinder::transact(unsignedint, android::Parcelconst&, android::Parcel*,unsignedint)+228)
native: #06pc000000000007fef8 /system/lib64/libgui.so (android::BpGraphicBufferProducer::dequeueBuffer(int*, android::sp<android::Fence>*,unsignedint,unsignedint,int,unsignedlong,unsignedlong*, android::FrameEventHistoryDelta*)+224)
native: #07pc00000000000b9880 /system/lib64/libgui.so (android::Surface::dequeueBuffer(ANativeWindowBuffer**,int*)+392)
native: #08pc0000000000380540 /system/lib64/libhwui.so (android::uirenderer::renderthread::ReliableSurface::hook_dequeueBuffer(ANativeWindow*, ANativeWindowBuffer**,int*)+104)
native: #09pc000000000000989c /vendor/lib64/egl/eglSubDriverAndroid.so (???)
native: #10pc0000000000009388 /vendor/lib64/egl/eglSubDriverAndroid.so (???)
native: #11pc000000000022cd2c /vendor/lib64/egl/libGLESv2_adreno.so (???)
native: #12pc0000000000212b8c /vendor/lib64/egl/libGLESv2_adreno.so (???)
native: #13pc0000000000020838 /system/lib64/libEGL.so (android::eglQuerySurfaceImpl(void*,void*,int,int*)+248)
native: #14pc00000000002c6794 /system/lib64/libhwui.so (android::uirenderer::renderthread::EglManager::beginFrame(void*)+224)
native: #15pc00000000002d414c /system/lib64/libhwui.so (android::uirenderer::renderthread::CanvasContext::draw()+236)
native: #16pc00000000002d34e8 /system/lib64/libhwui.so (_ZNSt3__110__function6__funcIZN7android10uirenderer12renderthread13DrawFrameTask11postAndWaitEvE3$_0NS_9allocatorIS6_EEFvvEEclEv$c303f2d2360db58ed70a2d0ac7ed911b+380)
native: #17pc00000000002de044 /system/lib64/libhwui.so (android::uirenderer::WorkQueue::process()+228)
native: #18pc00000000002ddd24 /system/lib64/libhwui.so (android::uirenderer::renderthread::RenderThread::threadLoop()+576)
native: #19pc0000000000013654 /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+328)
native: #20pc00000000000da200 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36)
native: #21pc00000000000769d4 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64)
可以看到 RenderThread 是卡在了 binder 通信上,调用的方法是 dequeueBuffer,在等待 dequeueBuffer 返回结果。正常情况下主线程是不等待 dequeueBuffer 的,在 nSyncAndDrawFrame 返回后就继续执行了,这里卡住是因为在等 RenderThread 的上一个 draw 任务完成,也就是说上一个 draw 任务卡在了 dequeueBuffer,导致这一帧的主线程卡住导致超时
对帧渲染流程熟悉的话,应该知道 dequeueBuffer 的对端是 SurfaceFlinger,这时候需要查看 SurfaceFlinger 是否有问题(这里如果有 binder info 就可以直接看到 renderthread 在跟 SurfaceFlinger 的哪个线程进行通信,binder info 中可以看到 binder 通信的细节,比如这个案例我们应该可以看到(由于没有 dump 出来 binder info,所以下面的只是一个例子)
// 956 是 SurfaceFlinger,这里是 SurfaceFlinger 的 incoming binder 信息
proc956:
incoming transaction28350131:0000000000000000from3365:3616to956:1357code2flags10pri0:110r1 node28346208size112:0data0000000000000000
// 3365 是 Launcher,3365:3616 => 956:1356 意思是从 Launcher(3365)的渲染线程(3616)到 SurfaceFlinger(956) 的 Binder:956_4(1357)
proc3365:
outgoing transaction28350131:0000000000000000from3365:3616to956:1357code2flags10pri0:110r1
这里我们搜 dequeueBuffer,来看 RenderThread 是在与 SurfaceFlinger 的那个 Bidner 线程通信,发现是 Binder:956_4
对应的 sysTid=1357
"Binder:956_4"sysTid=1357
#00pc0000000000071a8c /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId:340eb42dd36c6541aec54c306eb4d0ee)
#01pc0000000000075710 /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(voidvolatile*,bool,int,bool, timespecconst*)+140) (BuildId:340eb42dd36c6541aec54c306eb4d0ee)
#02pc00000000000d97e8 /apex/com.android.runtime/lib64/bionic/libc.so (pthread_cond_timedwait+120) (BuildId:340eb42dd36c6541aec54c306eb4d0ee)
#03pc0000000000072df0 /system/lib64/libc++.so (_ZNSt3__118condition_variable15__do_timed_waitERNS_11unique_lockINS_5mutexEEENS_6chrono10time_pointINS5_12system_clockENS5_8durationIxNS_5ratioILl1ELl1000000000EEEEEEE+108) (BuildId:5aad1075509f6e517eb78db32da8fbf6)
#04pc000000000006f9d8 /system/lib64/libgui.so (android::BufferQueueProducer::waitForFreeSlotThenRelock(android::BufferQueueProducer::FreeSlotCaller, std::__1::unique_lock<std::__1::mutex>&,int*)const+788) (BuildId: e017958f19275814d1f2d55ce8b10bfa)
#05pc000000000006fd8c /system/lib64/libgui.so (android::BufferQueueProducer::dequeueBuffer(int*, android::sp<android::Fence>*,unsignedint,unsignedint,int,unsignedlong,unsignedlong*, android::FrameEventHistoryDelta*)+776) (BuildId: e017958f19275814d1f2d55ce8b10bfa)
#06pc000000000007f0a8 /system/lib64/libgui.so (android::BnGraphicBufferProducer::onTransact(unsignedint, android::Parcelconst&, android::Parcel*,unsignedint)+3580) (BuildId: e017958f19275814d1f2d55ce8b10bfa)
#07pc000000000004d67c /system/lib64/libbinder.so (android::BBinder::transact(unsignedint, android::Parcelconst&, android::Parcel*,unsignedint)+136) (BuildId: bb7ca5d12323a310f26473506cf070ed)
#08pc000000000005a55c /system/lib64/libbinder.so (android::IPCThreadState::executeCommand(int)+1008) (BuildId: bb7ca5d12323a310f26473506cf070ed)
#09pc000000000005a0b8 /system/lib64/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+156) (BuildId: bb7ca5d12323a310f26473506cf070ed)
#10pc000000000005a898 /system/lib64/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+220) (BuildId: bb7ca5d12323a310f26473506cf070ed)
#11pc000000000008093c /system/lib64/libbinder.so (android::PoolThread::threadLoop()+24) (BuildId: bb7ca5d12323a310f26473506cf070ed)
#12pc0000000000013654 /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+328) (BuildId:594f10db565bb0b9cf0223c7a1990ce5)
#13pc00000000000da200 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36) (BuildId:340eb42dd36c6541aec54c306eb4d0ee)
#14pc00000000000769d4 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId:340eb42dd36c6541aec54c306eb4d0ee)
可以看到这里是卡在了 waitForFreeSlotThenRelock,其代码逻辑在 frameworks/native/libs/gui/BufferQueueProducer.cpp 中, 感兴趣的可以自己看一下
这里卡在 waitForFreeSlotThenRelock 的意思就是,没有足够的 Buffer,因为每个 Layer 对应的 Buffer 的个数是一定的(2 个或者 3 个或者 4 个,一般是 3 个或者 4 个),如果 4 个 Buffer 都在使用过程中,那么应用调用 dequeueBuffer 去申请 free 的 Buffer 是不会成功的,需要进行等待,这里就是在等待 free 的 buffer 超时,导致了应用的 ANR
Log 里面的 SurfaceFlinger、hwui 的部分 Log 也频繁打印来提示问题点
// SurfaceFlinger 的 Log 提示可用 buffer 数量不足
19:44:42.371 956 956 E BufferQueueConsumer: [com.xxx.launcher/com.xxx.launcher.Launcher#0] acquireBuffer: max acquired buffer count reached: 2 (max 1)
19:44:42.371 956 956 E BufferLayerConsumer: [com.xxx.launcher/com.xxx.launcher.Launcher#0] updateTexImage: acquire failed: Function not implemented (-38)
// 提示是 dequeueBuffer 失败
19:45:45.127 3365 3616 W OpenGLRenderer: dequeueBuffer failed, error = -110; switching to fallback
19:45:45.138 3365 3616 I OpenGLRenderer: Davey! duration=4015ms; Flags=0, IntendedVsync=21148991192817, Vsync=21148991192817, OldestInputEvent=9223372036854775807, NewestInputEvent=0, HandleInputStart=21148991206202, AnimationStart=21148991249587, PerformTraversalsStart=21148991252765, DrawStart=21148991798858, SyncQueued=21148991974796, SyncStart=21152137782451, IssueDrawCommandsStart=21152138724534, SwapBuffers=21156150646824, FrameCompleted=21156152280366, DequeueBufferDuration=0, QueueBufferDuration=0,
// MessageQueue 的信息标识是哪个 Message block
19:45:45.142 3365 3365 E ANR_LOG : >>> msg's executing time is too long
19:45:45.142 3365 3365 E ANR_LOG : Blocked msg = { when=-4s2ms what=0 target=android.view.Choreographer$FrameHandler callback=android.view.Choreographer$FrameDisplayEventReceiver } , cost = 4002 ms
19:45:45.142 3365 3365 E ANR_LOG : >>>Current msg List is:
19:45:45.142 3365 3365 E ANR_LOG : Current msg <1> = { when=-4s5ms what=13 target=android.view.ViewRootImpl$ViewRootHandler }
19:45:45.142 3365 3365 E ANR_LOG : Current msg <2> = { when=-4s1ms what=0 target=android.view.ViewRootImpl$ViewRootHandler callback=com.android.launcher3.util.ViewOnDrawExecutor }
19:45:45.143 3365 3365 E ANR_LOG : Current msg <3> = { when=-4s1ms what=0 target=android.view.ViewRootImpl$ViewRootHandler callback=android.widget.Editor$1 }
19:45:45.143 3365 3365 E ANR_LOG : Current msg <4> = { when=-3s23ms what=0 target=android.view.ViewRootImpl$ViewRootHandler callback=android.widget.ViewFlipper$2 }
19:45:45.143 3365 3365 E ANR_LOG : Current msg <5> = { when=-1s502ms what=6 target=android.view.inputmethod.InputMethodManager$H arg1=74627 obj=android.view.inputmethod.InputMethodManager$PendingEvent@561bec9 }
19:45:45.143 3365 3365 E ANR_LOG : Current msg <6> = { when=+14s859ms what=0 target=android.os.Handler callback=xxxUIEngineProguard.i.b$b }
19:45:45.143 3365 3365 E ANR_LOG : Current msg <7> = { when=+24d20h30m18s414ms what=0 target=android.view.ViewRootImpl$ViewRootHandler callback=android.widget.ViewFlipper$2 }
19:45:45.143 3365 3365 E ANR_LOG : >>>CURRENT MSG DUMP OVER<<<
这种问题一般是 SurfaceFlinger 的问题,可以转给对应的负责人继续分析
异常 Log 案例
主线程进行数据库读写
"main" prio=5 tid=1 Native
held mutexes=
kernel: (couldn't read /proc/self/task/11003/stack)
native: #00 pc 000492a4 /system/lib/libc.so (nanosleep+12)
native: #01 pc 0002dc21 /system/lib/libc.so (usleep+52)
native: #02 pc 00009cab /system/lib/libsqlite.so (???)
native: #03 pc 00011119 /system/lib/libsqlite.so (???)
native: #04 pc 00016455 /system/lib/libsqlite.so (???)
native: #16 pc 0000fa29 /system/lib/libsqlite.so (???)
native: #17 pc 0000fad7 /system/lib/libsqlite.so (sqlite3_prepare16_v2+14)
native: #18 pc 0007f671 /system/lib/libandroid_runtime.so (???)
native: #19 pc 002b4721 /system/framework/arm/boot-framework.oat (Java_android_database_sqlite_SQLiteConnection_nativePrepareStatement__JLjava_lang_String_2+116)
at android.database.sqlite.SQLiteConnection.setWalModeFromConfiguration(SQLiteConnection.java:294)
at android.database.sqlite.SQLiteConnection.open(SQLiteConnection.java:215)
at android.database.sqlite.SQLiteConnection.open(SQLiteConnection.java:193)
at android.database.sqlite.SQLiteConnectionPool.openConnectionLocked(SQLiteConnectionPool.java:463)
at android.database.sqlite.SQLiteConnectionPool.open(SQLiteConnectionPool.java:185)
at android.database.sqlite.SQLiteConnectionPool.open(SQLiteConnectionPool.java:177)
at android.database.sqlite.SQLiteDatabase.openInner(SQLiteDatabase.java:808)
locked <0x0db193bf> (a java.lang.Object)
at android.database.sqlite.SQLiteDatabase.open(SQLiteDatabase.java:793)
at android.database.sqlite.SQLiteDatabase.openDatabase(SQLiteDatabase.java:696)
at android.app.ContextImpl.openOrCreateDatabase(ContextImpl.java:690)
at android.content.ContextWrapper.openOrCreateDatabase(ContextWrapper.java:299)
at android.database.sqlite.SQLiteOpenHelper.getDatabaseLocked(SQLiteOpenHelper.java:223)
at android.database.sqlite.SQLiteOpenHelper.getWritableDatabase(SQLiteOpenHelper.java:163)
locked <0x045a4a8c> (a com.xxxx.video.common.data.DataBaseHelper)
at com.xxxx.video.common.data.DataBaseORM.<init>(DataBaseORM.java:46)
at com.xxxx.video.common.data.DataBaseORM.getInstance(DataBaseORM.java:53)
locked <0x017095d5> (a java.lang.Class<com.xxxx.video.common.data.DataBaseORM>)
Binder 数据量过大
07-2104:43:21.573 1000 148812756E Binder : Unreasonably large binder replybuffer: on android.content.pm.BaseParceledListSlice$1@770c74f calling1size388568(data:1,32,7274595)
07-2104:43:21.573 1000 148812756E Binder : android.util.Log$TerribleFailure: Unreasonably large binder replybuffer: on android.content.pm.BaseParceledListSlice$1@770c74f calling1size388568(data:1,32,7274595)
07-2104:43:21.607 1000 1488 2951E Binder : Unreasonably large binder replybuffer: on android.content.pm.BaseParceledListSlice$1@770c74f calling1size211848(data:1,23,7274595)
07-2104:43:21.607 1000 1488 2951E Binder : android.util.Log$TerribleFailure: Unreasonably large binder replybuffer: on android.content.pm.BaseParceledListSlice$1@770c74f calling1size211848(data:1,23,7274595)
07-2104:43:21.662 1000 1488 6258E Binder : Unreasonably large binder replybuffer: on android.content.pm.BaseParceledListSlice$1@770c74f calling1size259300(data:1,33,7274595)
Binder 通信失败
07-2106:04:35.580<6>[32837.690321] binder:1698:2362transaction failed29189/-3,size100-0line3042
07-2106:04:35.594<6>[32837.704042] binder:1765:4071transaction failed29189/-3,size76-0line3042
07-2106:04:35.899<6>[32838.009132] binder:1765:4067transaction failed29189/-3,size224-8line3042
07-2106:04:36.018<6>[32838.128903] binder:1765:2397transaction failed29189/-22,size348-0line2916
ANR 相关资料分享
- 西瓜视频稳定性治理体系建设一:Tailor 原理及实践
- 西瓜视频稳定性治理体系建设二:Raphael 原理及实践
- 西瓜视频稳定性治理体系建设三:Sliver 原理及实践
- 西瓜卡顿 & ANR 优化治理及监控体系建设
- 今日头条 ANR 优化实践系列 - 设计原理及影响因素
- 今日头条 ANR 优化实践系列 - 监控工具与分析思路
- 今日头条 ANR 优化实践系列分享 - 实例剖析集锦
- 今日头条 ANR 优化实践系列 - Barrier 导致主线程假死
- 今日头条 ANR 优化实践系列 - 告别 SharedPreference 等待
- Android ANR|原理解析及常见案例