监听 Android ANR 信号并获取所有方法栈信息

2 阅读12分钟

监听 Android ANR 信号并获取所有方法栈信息

在前面的文章中我有介绍过 ANR 的原理,感兴趣的同学可以看看:[Framework] 深入理解 Android ANR

AMS 向应用进程发送 ANR 信号后会被 Signal Catcher 线程捕获,然后它就会 dump 所有的线程栈信息到目录 /data/anr 中,这个目录是需要 root 权限才可以读取的,在虚拟机里面比较好拿到,通过 adb root 就可以直接获取 root 权限;不过一般的手机就比较难拿了,可以通过 adb bugreport 命令来导出这些文件。

虽然我们线下有方法获取 ANR 的 dump 文件,但是非常麻烦,而且 Android 没有提供专门的接口来监听 ANR 的回调,线上用户也没有办法获取到 ANR 的 dump 文件,所以本篇文章就是介绍如何监听 ANR 的信号和获取 ANR 时的 dump 文件信息。

监听 ANR 信号

AndroidANR 的信号是 SIGQUIT,它默认是被锁定的,无法替换它原来的信号处理函数,我们需要先解除锁定:

sigset_t sig_sets;
sigemptyset(&sig_sets);
sigaddset(&sig_sets, SIGQUIT);
pthread_sigmask(SIG_UNBLOCK, &sig_sets, nullptr);

在解除锁定后我们就可以替换原来的信号处理函数:

struct sigaction sigAction{};
sigfillset(&sigAction.sa_mask);
sigAction.sa_flags = SA_RESTART | SA_ONSTACK | SA_SIGINFO;
sigAction.sa_sigaction = anrSignalHandler;
ret = sigaction(SIGQUIT, &sigAction, nullptr);
if (ret == 0) {
    LOGD("Monitor anr signal success.");
} else {
    LOGE("Monitor anr signal fail: %d", ret);
}

上面代码中的 anrSignalHandler 就是我们的信号处理函数的指针,通过 sigaction() 方法去注册信号处理,这个函数的第三个参数是原来的旧的信号处理的 Action,我们只需要传入一个 struct sigaction 的指针就能够将原来的信号处理的 Action 写入到我们传入的地址中。获取到原来的信号处理函数后,我们就可以在收到信号后,继续传递给原来的信号处理函数。

不过我这里没有获取原来的处理函数,我自己尝试这么做,但是在收到信号后然后回调给原来的处理函数会出现报错,目前我也不知道出现这个问题的原因,所以我换了一个方法向原来的信号处理函数发送消息,后面会介绍。

再来看看我的信号处理函数:

static void anrSignalHandler(int sig, siginfo_t *sig_info, void *uc) {
    LOGD("Receive anr signal.");
    int fromPid1 = sig_info->_si_pad[3];
    int fromPid2 = sig_info->_si_pad[4];
    int myPid = getpid();
    if (fromPid1 != myPid && fromPid2 != myPid) {
        // 处理我们的逻辑
        pthread_mutex_lock(lock);
        if (dumpState == NO_DUMP) {
            dumpState = WAITING_ANR_DUMP;
        } else {
            LOGE("Skip dump anr, because state: %d", dumpState);
        }
        pthread_mutex_unlock(lock);
    }
    syscall(SYS_tgkill, myPid, gSignalCatcherTid, SIGQUIT);
}

前面我们讲到 ANR 信号是 AMS 向应用进程发送的,所以信号发送的进程肯定不是我们的应用进程,因为我们的应用进程可以给自己发送信号的,简单通过 kill 方法就可以。所以我们需要判断发送信号的进程不是我们的进程,我们才做 ANR 的处理。当收到 ANR 信号后我们需要再向 Signal Catcher 线程发送信号,发送的方式是 syscall(SYS_tgkill, myPid, gSignalCatcherTid, SIGQUIT);

这里问题又来了我们怎么获取 Signal Catchertid 呢?在 Linux/proc/[pid] 中存放了很多进程相关的信息,在 /proc/[pid]/task 目录下面存放了该进程所有的线程信息,文件名就是 tid,文件中的内容就是对应线程的名字。

OPD2A0:/proc/26483/task $ ls
16343  16346  16348  16350  16354  16357  16374  16377  16379  16381  16392  16394  16396  16398  16400  16402  16405  16412  16577  22976  22978
16344  16347  16349  16351  16355  16365  16376  16378  16380  16390  16393  16395  16397  16399  16401  16404  16407  16576  16814  22977  26483

所以通过读取上述文件就能够找到对应线程的 tid,反之也可以。

我这里给一下我写的参考代码:

int getSignalCatcherTid() {
    pid_t myPid = getpid();
    char *processPath = new char[MAX_BUFFER_SIZE];
    int size = sprintf(processPath, "/proc/%d/task", myPid);
    if (size >= MAX_BUFFER_SIZE) {
        LOGE("Read proc path fail, read buffer size: %d", size);
        return -1;
    }
    DIR *processDir = opendir(processPath);
    if (processDir) {
        int tid = -1;
        dirent * child = readdir(processDir);
        while (child != nullptr) {
            if (isNumberStr(child->d_name, 256)) {
                char *filePath = new char[MAX_BUFFER_SIZE];
                size = sprintf(filePath, "%s/%s/comm", processPath, child->d_name);
                if (size >= MAX_BUFFER_SIZE) {
                    continue;
                }
                char *threadName = new char[MAX_BUFFER_SIZE];
                int fd = open(filePath, O_RDONLY);
                size = read(fd, threadName, MAX_BUFFER_SIZE);
                close(fd);
                threadName[size - 1] = '\0';
                if (strcmp(threadName, "Signal Catcher") == 0) {
                    tid = atoi(child->d_name);
                    break;
                }
            }
            child = readdir(processDir);
        }
        closedir(processDir);
        return tid;
    } else {
        LOGE("Read process dir fail.");
    }
    return - 1;
}

获取 Signal Catcher 线程的 dump 文件

ANR 信号是监听到了,那么我们要怎么才能够获取到 Signal Catcher 线程写入的 dump 文件呢?首先要知道 Signal Catcher 线程,是我们应用进程中的一个线程,它是在我们应用进程启动时就创建了。我们想要获取它写的文件,就可以通过 PLT/GOT Hook 的方法,去 Hook 它的 write() 方法,这样我们就能够拿到它写入的内容了,我之前有介绍过 PLT/GOT Hook,感兴趣的同学可以参考这篇文章:手把手教你如何 Hook Native 方法

我这里使用了 xHook 来完成 hook


int hookSignalCatcherWrite() {
    int apiLevel = android_get_device_api_level();
    int signalCatcherTid = gSignalCatcherTid;
    if (signalCatcherTid <= 0) {
        signalCatcherTid = getSignalCatcherTid();
        gSignalCatcherTid = signalCatcherTid;
    }
    LOGD("ApiLevel: %d, SignalCatcherTid: %d", apiLevel, signalCatcherTid);
    if (signalCatcherTid <= 0) {
        LOGE("Get Signal Catcher tid fail.");
        return -1;
    }
    char *writeLibName;
    if (apiLevel >= 30 || apiLevel == 25 || apiLevel == 24) {
        writeLibName = ".*/libc\.so$";
    } else if (apiLevel == 29) {
        writeLibName = ".*/libbase\.so$";
    } else {
        writeLibName = ".*/libart\.so$";
    }
    int ret = xhook_register(writeLibName,
                   "write",
                   (void *) my_write,
                             nullptr);
    LOGD("xhook hook write register result: %d", ret);
    if (ret == 0) {
        ret = xhook_refresh(1);
        LOGD("xhook hook write refresh result: %d", ret);
        return ret;
    } else {
        return ret;
    }
}

不同的 Android 版本 hookso 库也不一样,我也是参考大佬们的操作,最好是去看 Android 源码,Signal Catcher 的相关代码被打包到哪个 so 中。

我们在简单看看我们的 hook 函数 my_write 的实现:

ssize_t my_write(int fd, const void *const buf, size_t count) {
    if (gSignalCatcherTid == gettid()) {
        pthread_mutex_lock(lock);
        if (dumpState != NO_DUMP) {
            LOGD("SignalCatcher write count: %d", count);
            long time = get_time_millis();
            char *stackFileName = new char[MAX_BUFFER_SIZE];
            const char * dir;
            if (dumpState == WAITING_STACK_DUMP) {
                dir = gStackTraceDir;
                LOGD("Start stack dump.");
            } else {
                dir = gAnrTraceDir;
                LOGD("Start anr dump.");
            }
            sprintf(stackFileName, "%s/%ld.text", dir, time);
            LOGD("Create stack file: %s", stackFileName);
            int fileFd = open(stackFileName, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
            if (fileFd < 0) {
                LOGE("Create file fail: %d", fd);
                goto end;
            }
            write(fileFd, buf, count);
            close(fileFd);
            write(gStackNotifyFd, &time, sizeof(time));
            goto end;
        } else {
            goto end;
        }
       end:
        pthread_mutex_unlock(lock);
    }
    return origin_write(fd, buf, count);
}

首先我们会先判断当前的线程是不是 Signal Catcher,同时还会判断我们自己设定的状态,如果这些都没有问题,我们就认为这是我们要的 ANR dump 文件,然后我们将它写入到我们的文件里面。
最后还会调用真正实现的 write() 方法。

主动获取所有的方法栈信息

通过系统的 ANR 信号来获取方法栈的 dump 信息,相对就被动一些,有的时候我们想要知道应用当前的所有线程的状态,这个时候我们就可以主动发送一个 SIGQUIT 信号给 Signal Catcher 线程,这样也可以通过 hook 拿到对应的 dump 文件,发送信号的方式和我们自定义的 signal action 中处理的方式一样,也是通过 syscall(SYS_tgkill, myPid, gSignalCatcherTid, SIGQUIT); 方法发送。

ANR dump 文件示例

// ...
suspend all histogram:	Sum: 165us 99% C.I. 1us-21us Avg: 7.173us Max: 21us
DALVIK THREADS (23):
"Signal Catcher" daemon prio=10 tid=2 Runnable
  | group="system" sCount=0 ucsCount=0 flags=0 obj=0x13600338 self=0xb400007bf3a26000
  | sysTid=5041 nice=-20 cgrp=default sched=0/0 handle=0x7bf4ffbcb0
  | state=R schedstat=( 28127001 5785385 10 ) utm=2 stm=0 core=5 HZ=100
  | stack=0x7bf4f04000-0x7bf4f06000 stackSize=991KB
  | held mutexes= "mutator lock"(shared held)
  native: #00 pc 0000000000570ec4  /apex/com.android.art/lib64/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, int, BacktraceMap*, char const*, art::ArtMethod*, void*, bool)+148) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #01 pc 0000000000675a24  /apex/com.android.art/lib64/libart.so (art::Thread::DumpStack(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, bool, BacktraceMap*, bool) const+340) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #02 pc 000000000069310c  /apex/com.android.art/lib64/libart.so (art::DumpCheckpoint::Run(art::Thread*)+908) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #03 pc 000000000068ccac  /apex/com.android.art/lib64/libart.so (art::ThreadList::RunCheckpoint(art::Closure*, art::Closure*)+508) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #04 pc 000000000068bf54  /apex/com.android.art/lib64/libart.so (art::ThreadList::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, bool)+1796) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #05 pc 000000000068b70c  /apex/com.android.art/lib64/libart.so (art::ThreadList::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char> >&)+1340) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #06 pc 000000000063d300  /apex/com.android.art/lib64/libart.so (art::Runtime::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char> >&)+208) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #07 pc 0000000000651dc0  /apex/com.android.art/lib64/libart.so (art::SignalCatcher::HandleSigQuit()+1376) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #08 pc 0000000000650e54  /apex/com.android.art/lib64/libart.so (art::SignalCatcher::Run(void*)+340) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #09 pc 00000000000eb720  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #10 pc 000000000007e2d0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  (no managed stack frames)

"main" prio=5 tid=1 Native
  | group="main" sCount=1 ucsCount=0 flags=1 obj=0x73869160 self=0xb400007c11e10800
  | sysTid=15609 nice=-10 cgrp=default sched=1073741824/0 handle=0x7cbd635500
  | state=S schedstat=( 1086854706 330699698 4068 ) utm=63 stm=45 core=6 HZ=100
  | stack=0x7fd3027000-0x7fd3029000 stackSize=8188KB
  | held mutexes=
  native: #00 pc 0000000000078dec  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #01 pc 00000000002833dc  /apex/com.android.art/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+140) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #02 pc 000000000043bf3c  /apex/com.android.art/lib64/libart.so (art::(anonymous namespace)::CheckJNI::FindClass(_JNIEnv*, char const*) (.llvm.11132044689082360456)+460) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #03 pc 0000000000128ebc  /system/lib64/libandroid_runtime.so (android::NativeDisplayEventReceiver::dispatchVsync(long, android::PhysicalDisplayId, unsigned int, android::gui::VsyncEventData)+92) (BuildId: 4da95a3e8bdc1b6a6682b67c10bdc47e)
  native: #04 pc 00000000000c1820  /system/lib64/libgui.so (android::DisplayEventDispatcher::handleEvent(int, int, void*)+272) (BuildId: 1d69b7a57862392ad7b7712ed6197e18)
  native: #05 pc 000000000001836c  /system/lib64/libutils.so (android::Looper::pollInner(int)+1068) (BuildId: 6038dbf95f76d91eaf842148f10f89ea)
  native: #06 pc 0000000000017ee0  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+112) (BuildId: 6038dbf95f76d91eaf842148f10f89ea)
  native: #07 pc 000000000016410c  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44) (BuildId: 4da95a3e8bdc1b6a6682b67c10bdc47e)
  at android.os.MessageQueue.nativePollOnce(Native method)
  at android.os.MessageQueue.next(MessageQueue.java:339)
  at android.os.Looper.loopOnce(Looper.java:186)
  at android.os.Looper.loop(Looper.java:351)
  at android.app.ActivityThread.main(ActivityThread.java:8377)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:584)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1013)

"Jit thread pool worker thread 0" daemon prio=5 tid=4 Native
  | group="system" sCount=1 ucsCount=0 flags=1 obj=0x135c0720 self=0xb400007bf3a47800
  | sysTid=5046 nice=9 cgrp=default sched=0/0 handle=0x7bf4d01cb0
  | state=S schedstat=( 12650002 4618461 48 ) utm=0 stm=0 core=1 HZ=100
  | stack=0x7bf4c02000-0x7bf4c04000 stackSize=1023KB
  | held mutexes=
  native: #00 pc 0000000000078dec  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #01 pc 00000000002833dc  /apex/com.android.art/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+140) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #02 pc 0000000000694b78  /apex/com.android.art/lib64/libart.so (art::ThreadPool::GetTask(art::Thread*)+120) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #03 pc 0000000000693f50  /apex/com.android.art/lib64/libart.so (art::ThreadPoolWorker::Run()+144) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #04 pc 00000000006939cc  /apex/com.android.art/lib64/libart.so (art::ThreadPoolWorker::Callback(void*)+172) (BuildId: f9461dad2df8cf4e9114de5c4ff5caf5)
  native: #05 pc 00000000000eb720  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #06 pc 000000000007e2d0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  (no managed stack frames)

"perfetto_hprof_listener" prio=10 tid=8 Native (still starting up)
  | group="" sCount=1 ucsCount=0 flags=1 obj=0x0 self=0xb400007bf3a6f800
  | sysTid=5044 nice=-20 cgrp=default sched=0/0 handle=0x7bf4efdcb0
  | state=S schedstat=( 119385 21461461 4 ) utm=0 stm=0 core=6 HZ=100
  | stack=0x7bf4e06000-0x7bf4e08000 stackSize=991KB
  | held mutexes=
  native: #00 pc 00000000000d5774  /apex/com.android.runtime/lib64/bionic/libc.so (read+4) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #01 pc 000000000001dee4  /apex/com.android.art/lib64/libperfetto_hprof.so (void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, ArtPlugin_Initialize::$_34> >(void*)+260) (BuildId: 13ee3b989b35c4e1d3ac372e558e2961)
  native: #02 pc 00000000000eb720  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #03 pc 000000000007e2d0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  (no managed stack frames)

"binder:15609_1" prio=5 tid=9 Native
  | group="main" sCount=1 ucsCount=0 flags=1 obj=0x13640020 self=0xb400007bf4867400
  | sysTid=5054 nice=-20 cgrp=default sched=0/0 handle=0x7bf42dfcb0
  | state=S schedstat=( 333385 370462 3 ) utm=0 stm=0 core=4 HZ=100
  | stack=0x7bf41e8000-0x7bf41ea000 stackSize=991KB
  | held mutexes=
  native: #00 pc 00000000000d5a54  /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+4) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #01 pc 00000000000873bc  /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+156) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #02 pc 000000000005f48c  /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+284) (BuildId: 821d5191ea842f908c210c9c338b12f6)
  native: #03 pc 000000000005f788  /system/lib64/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+24) (BuildId: 821d5191ea842f908c210c9c338b12f6)
  native: #04 pc 00000000000600a4  /system/lib64/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+68) (BuildId: 821d5191ea842f908c210c9c338b12f6)
  native: #05 pc 0000000000090048  /system/lib64/libbinder.so (android::PoolThread::threadLoop()+24) (BuildId: 821d5191ea842f908c210c9c338b12f6)
  native: #06 pc 0000000000013550  /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+416) (BuildId: 6038dbf95f76d91eaf842148f10f89ea)
  native: #07 pc 00000000000cc59c  /system/lib64/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+140) (BuildId: 4da95a3e8bdc1b6a6682b67c10bdc47e)
  native: #08 pc 00000000000eb720  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  native: #09 pc 000000000007e2d0  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: cd953571180b7f5f8ae5570dad29595f)
  (no managed stack frames)
// ... 

这个文件中包含所有的 Java 线程栈和 Native 线程栈,而且其中还包含线程的状态,锁信息,栈大小等等有用的信息,这些信息对我们分析问题也非常有帮助。

最后

我把上面的所有代码都开源了,而且还发布成了一个单独的 aar 库,感兴趣的同学可以看看:dumpstack