Android ANR深入分析

1,868 阅读11分钟

一:说明

本文针对相对比较复杂的anr问题。

二:ANR分类说明

image.png 上图为笔者从android9的代码中梳理出的所有anr。

所有anr都与四大组件关联在一起,整体上可分为两类

  1. 与用户直接交互引起的anr,触发源头为input。
  2. 后台组件操作超时引起的anr。触发源头又分为两类:
    • 第一类是服务和广播,都是直接在ams的线程中触发,
    • 第二类是contentprovider,触发源头为发起contentprovider操作的进程主线程。

而所有anr的原因,可以分为以下几类:

  1. 受anr监听的事件本身超时,如touch事件处理超时。这类anr比较容易定位。
  2. 受anr监听的事件无法进入主线程开始处理,如主线程被block或任务忙,等待时间超时引起anr。
  3. 窗口状态不对,如绘制超时,焦点错误等引起。

ps:主线程阻塞了不一定会引起ANR,有些情况下,主线程可能阻塞了一些时间,但在这些时间内,没有事件进入,因此就不触发anr。

不同ANR事件的监听实现

  1. input anr,通过InputDispatcher线程监听(直接用looper的epoll timeout,对于当前inputChannel,是阻塞式的,对其他channel则是非阻塞式的)
  2. service anr,通过AMS的主线程(线程名:ActivityManager)监听(通过Handler的delay,非阻塞式)
  3. broadcast anr,通过AMS的主线程(线程名:ActivityManager)监听(通过Handler的delay,非阻塞式)
  4. ContentProvider anr,通过当前进程的主线程去监听,触发后调用AMS去触发anr(通过Handler的delay,非阻塞式)

下图为典型的input anr触发流程

sequenceDiagram
InputDispatcher->>APPMainThread: send event to app
Note left of APPMainThread:正常情况下,app finsh event 的时间很短
APPMainThread-->>InputDispatcher: app finish event handle

InputDispatcher->>APPMainThread: send event to app
Note left of APPMainThread:anr时,app主线程在超时时间内
Note left of APPMainThread:不向InputDispatcher返回状态
InputDispatcher->>APPMainThread: send event to app
Note right of InputDispatcher:在下一个event事件时,触发一个5s的计时,

InputDispatcher ->> ActivityManager : triger anr
Note right of InputDispatcher:5s后如果还未收到finish,就触发anr

下图为典型的 service/broadcast anr触发流程

sequenceDiagram
Note right of ActivityManager:正常情况时:
Note right of ActivityManager:在跨进程之前,发一个超时的handler消息
ActivityManager->> APPMainThread: service/broadcast call
Note left of APPMainThread:ams调用相关的操作
APPMainThread-->>ActivityManager: app service/broadcast finish handle
Note right of ActivityManager:在跨进程之后,移出超时的handler消息

Note right of ActivityManager:ANR时:


ActivityManager->> APPMainThread: service/broadcast call
Note left of APPMainThread:应用一直无响应,超时消息到时后发出
Note left of APPMainThread:ams就执行对应的anr事件。

下图为典型的 contentProvider anr触发流程

sequenceDiagram
Note right of APP1MainThread: app1在跨进程调用contentProvider前,发一个超时消息
APP1MainThread->>APP2MainThread: call contentProvider method
APP2MainThread-->>APP1MainThread : return
Note right of APP1MainThread: 在超时前收到 contentProvider 的调用,移出超时的handler消息
Note right of APP1MainThread: 在超时前未收到 contentProvider 的调用,则向ams触发anr
APP1MainThread->>ActivityManager: triger anr

三:一些比较典型的复杂anr

1: 主线程空闲

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x752a4b08 self=0x7204214c00
  | sysTid=4434 nice=-10 cgrp=default sched=0/0 handle=0x7289e7b548
  | state=S schedstat=( 857618499 263812800 679 ) utm=70 stm=15 core=1 HZ=100
  | stack=0x7fccb8f000-0x7fccb91000 stackSize=8MB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/4434/stack)
  native: #00 pc 000000000006e118  /system/lib64/libc.so (offset 6e000) (__epoll_pwait+8)
  native: #01 pc 0000000000013f8c  /system/lib64/libutils.so (android::Looper::pollInner(int)+144)
  native: #02 pc 0000000000013e60  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+60)
  native: #03 pc 000000000011f9c0  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44)
  at android.os.MessageQueue.nativePollOnce(Native method)
  at android.os.MessageQueue.next(MessageQueue.java:326)
  at android.os.Looper.loop(Looper.java:160)
  at android.app.ActivityThread.main(ActivityThread.java:6733)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:506)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:858)

如上trace,出现anr的进程的主线程显空闲,很多同学在遇到类anr的时候会比较无助,大多数情况下,应用开发的同学会直接转给做系统开发的同学继续跟进。 遇到这类问题时,先确认anr触发时间和trace dump 时间差多久,以为附录示例为参考

  • 10-14 15:43:19.393在inputdispatcher中触发anr,
  • 10-14 15:43:19.421 在ams中开始处理anr
  • 10-14 15:43:19.934 开始dump anr进程的调用stack
  • 10-14 15:43:20.239 完成 anr进程的调用栈dump.

在这个示例中,从inputdispatcher发现事件超时,到开始dump anr进程的trace,花了500+ms,到dump完成,总共花了850+ms,这次不算长,也不算短,但实际上在这次中,主线程dump出来的stack是处于空闲态的。当然,这次anr并不是因为主线已经过了block时间点,但实际开发debug中,有很多情况下是dump的时候,anr的进程已经过了阻塞点。此次只能基于logcag的信息,找到anr触发的具体时间,再去看主线程的log.不然就容易在错误的时间,基于错误的log的分析。

2:跨线程/进程阻塞

被其他线程block的是很容易定位的,trace中通过lock锁就可以轻易找到,这里就不展开了。

跨进程阻塞:(这里只包含跨进程的部分,更早的调用栈包含公司包名信息,就不贴出来了)

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x73ec6b08 self=0x7e38414c00
  | sysTid=2411 nice=-10 cgrp=default sched=0/0 handle=0x7ebdf9b548
  | state=S schedstat=( 1110108606 4113781665 1463 ) utm=72 stm=38 core=0 HZ=100
  | stack=0x7ffbe55000-0x7ffbe57000 stackSize=8MB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/2411/stack)
  native: #00 pc 000000000006e218  /system/lib64/libc.so (__ioctl+4)
  native: #01 pc 0000000000029124  /system/lib64/libc.so (ioctl+136)
  native: #02 pc 000000000005a5e4  /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+256)
  native: #03 pc 000000000005b374  /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+60)
  native: #04 pc 000000000005b1c0  /system/lib64/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+176)
  native: #05 pc 0000000000051318  /system/lib64/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+72)
  native: #06 pc 000000000006a47c  /system/lib64/libmedia.so (android::BpMediaPlayerService::getCodecList() const+248)
  native: #07 pc 00000000001633b4  /system/lib64/libstagefright.so (android::MediaCodecList::getInstance() [clone .cfi]+332)
  native: #08 pc 0000000000034128  /system/lib64/libmedia_jni.so (android_media_MediaCodecList_getCodecCount(_JNIEnv*, _jobject*)+36)
  at android.media.MediaCodecList.native_getCodecCount(Native method)
  at android.media.MediaCodecList.initCodecList(MediaCodecList.java:82)

这调用栈中,可以看到主线程处在binder调用中,具体是哪一个调用,一般看BpBinder::transact前一个调用的方法是什么,这里是getCodecList,那就在整个anr文件中搜哪个进程在被调用这个方法。

"Binder:658_1" sysTid=951
  #00 pc 00053fac  /system/lib/libc.so (__ioctl+8)
  #01 pc 00021c59  /system/lib/libc.so (ioctl+36)
  #02 pc 00015def  /system/lib/libhwbinder.so (android::hardware::IPCThreadState::talkWithDriver(bool)+190)
  #03 pc 0001696d  /system/lib/libhwbinder.so (android::hardware::IPCThreadState::waitForResponse(android::hardware::Parcel*, int*)+268)
  #04 pc 00015bb7  /system/lib/libhwbinder.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+90)
  #05 pc 00028cfb  /system/lib/android.hardware.media.omx@1.0.so (android::hardware::media::omx::V1_0::BpHwOmx::_hidl_allocateNode(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, android::hardware::hidl_string const&, android::sp<android::hardware::media::omx::V1_0::IOmxObserver> const&, std::__1::function<void (android::hardware::media::omx::V1_0::Status, android::sp<android::hardware::media::omx::V1_0::IOmxNode> const&)>)+406)
  #06 pc 000296ef  /system/lib/android.hardware.media.omx@1.0.so (android::hardware::media::omx::V1_0::BpHwOmx::allocateNode(android::hardware::hidl_string const&, android::sp<android::hardware::media::omx::V1_0::IOmxObserver> const&, std::__1::function<void (android::hardware::media::omx::V1_0::Status, android::sp<android::hardware::media::omx::V1_0::IOmxNode> const&)>)+78)
  #07 pc 0001dceb  /system/lib/libmedia_omx.so (android::hardware::media::omx::V1_0::utils::LWOmx::allocateNode(char const*, android::sp<android::IOMXObserver> const&, android::sp<android::IOMXNode>*)+98)
  #08 pc 0005d51d  /system/lib/libstagefright.so (android::ACodec::queryCapabilities(char const*, char const*, char const*, bool, android::MediaCodecInfo::CapabilitiesWriter*)+200)
  #09 pc 000a0115  /system/lib/libstagefright.so (android::(anonymous namespace)::queryCapabilities(android::hardware::media::omx::V1_0::IOmxStore::NodeInfo const&, char const*, bool, android::MediaCodecInfo::CapabilitiesWriter*)+76)
  #10 pc 0009fed9  /system/lib/libstagefright.so (android::OmxInfoBuilder::buildMediaCodecList(android::MediaCodecListWriter*)+996)
  #11 pc 0008ffc3  /system/lib/libstagefright.so (android::MediaCodecList::MediaCodecList(std::__1::vector<android::MediaCodecListBuilderBase*, std::__1::allocator<android::MediaCodecListBuilderBase*>>)+234)
  #12 pc 0009015d  /system/lib/libstagefright.so (android::MediaCodecList::getLocalInstance()+68)
  #13 pc 00039279  /system/lib/libmediaplayerservice.so (android::MediaPlayerService::getCodecList() const+36)
  #14 pc 00033ce9  /system/lib/libmedia.so (android::BnMediaPlayerService::onTransact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+640)
  #15 pc 000361cf  /system/lib/libbinder.so (android::BBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+70)
  #16 pc 0003da17  /system/lib/libbinder.so (android::IPCThreadState::executeCommand(int)+410)
  #17 pc 0003d7a3  /system/lib/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+106)
  #18 pc 0003dccb  /system/lib/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+38)
  #19 pc 0005470d  /system/lib/libbinder.so (android::PoolThread::threadLoop()+12)
  #20 pc 0000c1a3  /system/lib/libutils.so (android::Thread::_threadLoop(void*)+166)
  #21 pc 00063cb5  /system/lib/libc.so (__pthread_start(void*)+22)
  #22 pc 0001e0ad  /system/lib/libc.so (__start_thread+22)

搜索发现,在/system/bin/mediaserver进程中,有多个 getCodecList 被调用,这是一个互斥的调用,在此进程中,只有上面显示的这个调用栈在向下调用。 在这个调用中,可以看这个调用最后也是一个binder调用,也处在被其他进程block的状态。 那下一步就是继续找BpHwBinder::transact 的前一个调用的是什么方法,从trace中看是_hidl_allocateNode 接下来继续去追 _hidl_allocateNode的调用。直到找到真正的block点。

附录:一个完整的input anr 流程分析

下面通过一个完整的input anr触发流程,展示一个anr问题的分析过程(简单的anr并不需要完整的分析流程)这个流程分成5步,很多人在anr的trace文件中没有看到直接的问题点时,就直接从第5步所在位置开始分析,并从这个时间点向前推算,但这可能并不准确。就如示例中的一样,ActivityManager : ANR in所在时间点事实上与anr触发时有5s的差距。

正确的步骤应该是从Event log am_anr开始,如果没有event log,则向前看,直接找InputDispatcher,如果是input事件超时,会有一个wait queue age,用这个时间向前推,则可以找到精确的input时间。

找到精确的input时间后,再通过anr进程的主线程log,去看这个时间点上,主线程做了什么,如果前后都没有主线程的log,则有可能是阻塞了,再找anr前最近的主线程log去看,大概率就是那行log之后的操作引起的。

  1. 触发anr的事件

10-14 15:43:14.386 3693 3706 D (input事件,具体log包含公司信息,就不贴出来了)

  1. InputDispatcher 检测到事件超时,触发anr事件

默认超时时间为5s,超时方式为epoll_wait超时

10-14 15:43:19.393 1946 2036 I InputDispatcher : Application is not responding: AppWindowToken{d0046c6 token=Token{7a154a1 ActivityRecord{e26d508 u0 com.XXX.XXX/.activity.MainActivity t49}}}. It has been 5006.9ms since event, 5006.2ms since wait started. Reason: Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.

10-14 15:43:19.394 1946 2036 I WindowManager : Input event dispatching timed out sending to application AppWindowToken{d0046c6 token=Token{7a154a1 ActivityRecord{e26d508 u0 com.XXX.XXX/.activity.MainActivity t49}}}. Reason: Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.
  1. AMS收到anr事件,开始处理anr事件
10-14 15:43:19.421 1946 1961 I am_anr : [0,4827,com.XXX.XXX,550026829,Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)]
  1. 开始依次dump当前进程的调用栈
10-14 15:43:19.934 1946 1961 I system_server : libdebuggerd_client: started dumping process 4827
10-14 15:43:20.239 1946 1961 I system_server : libdebuggerd_client: done dumping process 4827
10-14 15:43:20.239 1946 1961 I system_server : libdebuggerd_client: started dumping process 1946
10-14 15:43:20.240 1946 1951 I system_server : Thread[2,tid=1951,WaitingInMainSignalCatcherLoop,Thread*=0x6f3f229000,peer=0x12d00000,"Signal Catcher"]: reacting to signal 3
10-14 15:43:20.240 1946 1951 I system_server :
10-14 15:43:20.421 1946 1951 I system_server : Wrote stack traces to '[tombstoned]'
10-14 15:43:20.421 1946 1961 I system_server : libdebuggerd_client: done dumping process 1946
...
10-14 15:43:26.570 1946 1961 I system_server : libdebuggerd_client: started dumping process 5840
10-14 15:43:26.716 1946 1961 I system_server : libdebuggerd_client: done dumping process 5840

  1. 完成调用栈,把cpu的负载状态打印出来
10-14 15:43:26.757 1946 1961 E ActivityManager : ANR in com.XXX.XXX (com.XXX.XXX/.activity.MainActivity)
10-14 15:43:26.757 1946 1961 E ActivityManager : PID: 4827
10-14 15:43:26.757 1946 1961 E ActivityManager : Reason: Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)
10-14 15:43:26.757 1946 1961 E ActivityManager : Load: 10.53 / 4.93 / 1.85
10-14 15:43:26.757 1946 1961 E ActivityManager : CPU usage from 0ms to 7320ms later (2022-10-14 15:43:19.396 to 2022-10-14 15:43:26.716):
10-14 15:43:26.757 1946 1961 E ActivityManager : 28% 2227/com.XXX.XXX: 16% user + 12% kernel / faults: 7991 minor 8 major
10-14 15:43:26.757 1946 1961 E ActivityManager : 19% 530/surfaceflinger: 9.5% user + 9.6% kernel / faults: 993 minor
10-14 15:43:26.757 1946 1961 E ActivityManager : 14% 6036/adbd: 1.7% user + 12% kernel / faults: 154514 minor
10-14 15:43:26.757 1946 1961 E ActivityManager : 11% 533/vendor.XXX.hardware.XXX@1.0-service: 9.6% user + 2% kernel / faults: 2508 minor
10-14 15:43:26.757 1946 1961 E ActivityManager : 10% 1946/system_server: 3.9% user + 6.6% kernel / faults: 4631 minor

至此,和anr相关的log信信息已经取完。

实际分析问题时,可以结合上面的5个时间节点去找对应的信息。