三方库源码笔记(1)-xCrash分析理解,点亮Android崩溃捕获技能树

1,949 阅读4分钟

对于优秀的三方库,我们要做到知其然,也知其所以然。看到很多技术博客分析理解EventBus、ARouter、LeakCanary、Retrofit、Glide、OkHttp这些非常知名的开源库,已然十分全面了。所以我打算来写一系列除了这几个知名库,略小众的但是也是十分有用的三方库源码分析,希望对你有帮助。

0x01. xCrash简介

xcrash_logo.png

xCrash 是爱奇艺开源的Android崩溃捕获框架,能为安卓 app 提供捕获 java 崩溃,native 崩溃和 ANR 的能力。不需要 root 权限或任何系统权限。 app 进程崩溃或 ANR 时,在你指定的目录中生成一个 tombstone 文件(格式与安卓系统的 tombstone 文件类似)。在 爱奇艺 的不同平台(手机,平板,电视)的很多安卓 app(包括爱奇艺视频)中被使用了很多年。

0x02. Java Crash

捕捉Java Crash是通过自定义UncaughtExceptionHandler,然后使用Thread.setDefaultUncaughtExceptionHandler(this)把自定义的UncaughtExceptionHandler设置进去。

class JavaCrashHandler implements UncaughtExceptionHandler {

    /**略**/

    void initialize(int pid, String processName, String appId, String appVersion, String logDir, boolean rethrow,
                    int logcatSystemLines, int logcatEventsLines, int} logcatMainLines,
                    boolean dumpFds, boolean dumpNetworkInfo, boolean dumpAllThreads, int dumpAllThreadsCountMax, String[] dumpAllThreadsWhiteList,
                    ICrashCallback callback) {
        this.pid = pid;
        this.processName = (TextUtils.isEmpty(processName) ? "unknown" : processName);
        this.appId = appId;
        this.appVersion = appVersion;
        this.rethrow = rethrow;
        this.logDir = logDir;
        this.logcatSystemLines = logcatSystemLines;
        this.logcatEventsLines = logcatEventsLines;
        this.logcatMainLines = logcatMainLines;
        this.dumpFds = dumpFds;
        this.dumpNetworkInfo = dumpNetworkInfo;
        this.dumpAllThreads = dumpAllThreads;
        this.dumpAllThreadsCountMax = dumpAllThreadsCountMax;
        this.dumpAllThreadsWhiteList = dumpAllThreadsWhiteList;
        this.callback = callback;
        this.defaultHandler = Thread.getDefaultUncaughtExceptionHandler();

        try {
            Thread.setDefaultUncaughtExceptionHandler(this);
        } catch (Exception e) {
            XCrash.getLogger().e(Util.TAG, "JavaCrashHandler setDefaultUncaughtExceptionHandler failed", e);
        }
    }

    /**略**/

}

在这里注意这一句

this.defaultHandler = Thread.getDefaultUncaughtExceptionHandler();

这里把默认的UncaughtExceptionHandler保存下来是为了我们在自定义的UncaughtExceptionHandler中生成Tombstone之后在把UncaughtExceptionHandler还原回去,交给系统处理。

    @Override
    public void uncaughtException(Thread thread, Throwable throwable) {
        if (defaultHandler != null) {
            Thread.setDefaultUncaughtExceptionHandler(defaultHandler);
        }

        try {
            handleException(thread, throwable);
        } catch (Exception e) {
            XCrash.getLogger().e(Util.TAG, "JavaCrashHandler handleException failed", e);
        }

        if (this.rethrow) {
            if (defaultHandler != null) {
                defaultHandler.uncaughtException(thread, throwable);
            }
        } else {
            ActivityMonitor.getInstance().finishAllActivities();
            Process.killProcess(this.pid);
            System.exit(10);
        }
    }

0x03. Anr Crash

xCrash的Anr Crash的捕获要分开来谈

  • API level < 21

    class AnrHandler {
        /**略**/
        void initialize(Context ctx, int pid, String processName, String appId, String appVersion, String logDir,
                        boolean checkProcessState, int logcatSystemLines, int logcatEventsLines, int logcatMainLines,
                        boolean dumpFds, boolean dumpNetworkInfo, ICrashCallback callback) {
    
            //check API level
            if (Build.VERSION.SDK_INT >= 21) {
                return;
            }
    
            this.ctx = ctx;
            this.pid = pid;
            this.processName = (TextUtils.isEmpty(processName) ? "unknown" : processName);
            this.appId = appId;
            this.appVersion = appVersion;
            this.logDir = logDir;
            this.checkProcessState = checkProcessState;
            this.logcatSystemLines = logcatSystemLines;
            this.logcatEventsLines = logcatEventsLines;
            this.logcatMainLines = logcatMainLines;
            this.dumpFds = dumpFds;
            this.dumpNetworkInfo = dumpNetworkInfo;
            this.callback = callback;
    
            fileObserver = new FileObserver("/data/anr/", CLOSE_WRITE) {
                public void onEvent(int event, String path) {
                    try {
                        if (path != null) {
                            String filepath = "/data/anr/" + path;
                            if (filepath.contains("trace")) {
                                handleAnr(filepath);
                            }
                        }
                    } catch (Exception e) {
                        XCrash.getLogger().e(Util.TAG, "AnrHandler fileObserver onEvent failed", e);
                    }
                }
            };
    
            try {
                fileObserver.startWatching();
            } catch (Exception e) {
                fileObserver = null;
                XCrash.getLogger().e(Util.TAG, "AnrHandler fileObserver startWatching failed", e);
            }
        }
        /**略**/
    }
    

    在Android5.0以前可以通过注册一个FileObserver来监听/data/anr/下面的文件来确定Anr是否发生,但是这个方法在更高的版本已经失效了。

  • API level >= 21

    在更高的版本上面,xCrash是在native层监听signal信号来确定Anr是否发生

    int xc_trace_init(JNIEnv *env,
                      int rethrow,
                      unsigned int logcat_system_lines,
                      unsigned int logcat_events_lines,
                      unsigned int logcat_main_lines,
                      int dump_fds,
                      int dump_network_info)
    {
        int r;
        pthread_t thd;
    
        //capture SIGQUIT only for ART
        if(xc_common_api_level < 21) return 0;
    
        //is Android Lollipop (5.x)?
        xc_trace_is_lollipop = ((21 == xc_common_api_level || 22 == xc_common_api_level) ? 1 : 0);
    
        xc_trace_dump_status = XC_TRACE_DUMP_NOT_START;
        xc_trace_rethrow = rethrow;
        xc_trace_logcat_system_lines = logcat_system_lines;
        xc_trace_logcat_events_lines = logcat_events_lines;
        xc_trace_logcat_main_lines = logcat_main_lines;
        xc_trace_dump_fds = dump_fds;
        xc_trace_dump_network_info = dump_network_info;
    
        //init for JNI callback
        xc_trace_init_callback(env);
    
        //create event FD
        if(0 > (xc_trace_notifier = eventfd(0, EFD_CLOEXEC))) return XCC_ERRNO_SYS;
    
        //register signal handler
        if(0 != (r = xcc_signal_trace_register(xc_trace_handler))) goto err2;
    
        //create thread for dump trace
        if(0 != (r = pthread_create(&thd, NULL, xc_trace_dumper, NULL))) goto err1;
    
        return 0;
    
     err1:
        xcc_signal_trace_unregister();
     err2:
        close(xc_trace_notifier);
        xc_trace_notifier = -1;
        
        return r;
    }
    

    注意这一句代码

    if(0 > (xc_trace_notifier = eventfd(0, EFD_CLOEXEC))) return XCC_ERRNO_SYS;
    

    这里先注册event FD,是因为FD 泄露是常见的导致进程崩溃的间接原因。这意味着在 signal handler 中无法正常的使用依赖于 FD 的操作,比如无法 open() + read() 读取/proc 中的各种信息。为了不干扰 APP 的正常运行,预留了一个 FD,用于在崩溃时可靠的创建出“崩溃信息记录文件”。

    int xcc_signal_trace_register(void (*handler)(int, siginfo_t *, void *))
    {
        int              r;
        sigset_t         set;
        struct sigaction act;
    
        //un-block the SIGQUIT mask for current thread, hope this is the main thread
        sigemptyset(&set);
        sigaddset(&set, SIGQUIT);
        if(0 != (r = pthread_sigmask(SIG_UNBLOCK, &set, &xcc_signal_trace_oldset))) return r;
    
        //register new signal handler for SIGQUIT
        memset(&act, 0, sizeof(act));
        sigfillset(&act.sa_mask);
        act.sa_sigaction = handler;
        act.sa_flags = SA_RESTART | SA_SIGINFO;
        if(0 != sigaction(SIGQUIT, &act, &xcc_signal_trace_oldact))
        {
            pthread_sigmask(SIG_SETMASK, &xcc_signal_trace_oldset, NULL);
            return XCC_ERRNO_SYS;
        }
    
        return 0;
    }
    

    在这里,首先通过pthread_sigmask把SIGQUIT信号设置为UNBLOCK, 然后通过sigaction方法注册signal handler,监听sigquit信号;sigaction()三个参数,第一个是需要监听的信号,第二个是新的action,第三个是旧的action。 action中我们接收到回调回修改xc_trace_notifier。

    static void *xc_trace_dumper(void *arg)
    {
        JNIEnv         *env = NULL;
        uint64_t        data;
        uint64_t        trace_time;
        int             fd;
        struct timeval  tv;
        char            pathname[1024];
        jstring         j_pathname;
        
        (void)arg;
        
        pthread_detach(pthread_self());
    
        JavaVMAttachArgs attach_args = {
            .version = XC_JNI_VERSION,
            .name    = "xcrash_trace_dp",
            .group   = NULL
        };
        if(JNI_OK != (*xc_common_vm)->AttachCurrentThread(xc_common_vm, &env, &attach_args)) goto exit;
    
        while(1)
        {
            //block here, waiting for sigquit
            XCC_UTIL_TEMP_FAILURE_RETRY(read(xc_trace_notifier, &data, sizeof(data)));
            
            //check if process already crashed
            if(xc_common_native_crashed || xc_common_java_crashed) break;
    
            //trace time
            if(0 != gettimeofday(&tv, NULL)) break;
            trace_time = (uint64_t)(tv.tv_sec) * 1000 * 1000 + (uint64_t)tv.tv_usec;
    
            //Keep only one current trace.
            if(0 != xc_trace_logs_clean()) continue;
    
            //create and open log file
            if((fd = xc_common_open_trace_log(pathname, sizeof(pathname), trace_time)) < 0) continue;
    
            //write header info
            if(0 != xc_trace_write_header(fd, trace_time)) goto end;
    
            //write trace info from ART runtime
            if(0 != xcc_util_write_format(fd, XCC_UTIL_THREAD_SEP"Cmd line: %s\n", xc_common_process_name)) goto end;
            if(0 != xcc_util_write_str(fd, "Mode: ART DumpForSigQuit\n")) goto end;
            if(0 != xc_trace_load_symbols())
            {
                if(0 != xcc_util_write_str(fd, "Failed to load symbols.\n")) goto end;
                goto skip;
            }
            if(0 != xc_trace_check_address_valid())
            {
                if(0 != xcc_util_write_str(fd, "Failed to check runtime address.\n")) goto end;
                goto skip;
            }
            if(dup2(fd, STDERR_FILENO) < 0)
            {
                if(0 != xcc_util_write_str(fd, "Failed to duplicate FD.\n")) goto end;
                goto skip;
            }
    
            xc_trace_dump_status = XC_TRACE_DUMP_ON_GOING;
            if(sigsetjmp(jmpenv, 1) == 0) 
            {
                if(xc_trace_is_lollipop)
                    xc_trace_libart_dbg_suspend();
                xc_trace_libart_runtime_dump(*xc_trace_libart_runtime_instance, xc_trace_libcpp_cerr);
                if(xc_trace_is_lollipop)
                    xc_trace_libart_dbg_resume();
            } 
            else 
            {
                fflush(NULL);
                XCD_LOG_WARN("longjmp to skip dumping trace\n");
            }
    
            dup2(xc_common_fd_null, STDERR_FILENO);
                                
        skip:
            if(0 != xcc_util_write_str(fd, "\n"XCC_UTIL_THREAD_END"\n")) goto end;
    
            //write other info
            if(0 != xcc_util_record_logcat(fd, xc_common_process_id, xc_common_api_level, xc_trace_logcat_system_lines, xc_trace_logcat_events_lines, xc_trace_logcat_main_lines)) goto end;
            if(xc_trace_dump_fds)
                if(0 != xcc_util_record_fds(fd, xc_common_process_id)) goto end;
            if(xc_trace_dump_network_info)
                if(0 != xcc_util_record_network_info(fd, xc_common_process_id, xc_common_api_level)) goto end;
            if(0 != xcc_meminfo_record(fd, xc_common_process_id)) goto end;
    
        end:
            //close log file
            xc_common_close_trace_log(fd);
    
            //rethrow SIGQUIT to ART Signal Catcher
            if(xc_trace_rethrow && (XC_TRACE_DUMP_ART_CRASH != xc_trace_dump_status)) xc_trace_send_sigquit();
            xc_trace_dump_status = XC_TRACE_DUMP_END;
    
            //JNI callback
            //Do we need to implement an emergency buffer for disk exhausted?
            if(NULL == xc_trace_cb_method) continue;
            if(NULL == (j_pathname = (*env)->NewStringUTF(env, pathname))) continue;
            (*env)->CallStaticVoidMethod(env, xc_common_cb_class, xc_trace_cb_method, j_pathname, NULL);
            XC_JNI_IGNORE_PENDING_EXCEPTION();
            (*env)->DeleteLocalRef(env, j_pathname);
        }
        
        (*xc_common_vm)->DetachCurrentThread(xc_common_vm);
    
     exit:
        xc_trace_notifier = -1;
        close(xc_trace_notifier);
        return NULL;
    }
    

    在新开的线程中,有一个死循环,监听xc_trace_notifier,如果为1的时候说明发生了anr,记录线程,然后调用xc_trace_init_callback()中保存下来的java回调方法,通知java层完成监控。

0x04. Native Crash

捕获native Crash的流程和高版本捕获anr的流程有些相似

int xc_crash_init(JNIEnv *env,
                  int rethrow,
                  unsigned int logcat_system_lines,
                  unsigned int logcat_events_lines,
                  unsigned int logcat_main_lines,
                  int dump_elf_hash,
                  int dump_map,
                  int dump_fds,
                  int dump_network_info,
                  int dump_all_threads,
                  unsigned int dump_all_threads_count_max,
                  const char **dump_all_threads_whitelist,
                  size_t dump_all_threads_whitelist_len)
{
    xc_crash_prepared_fd = XCC_UTIL_TEMP_FAILURE_RETRY(open("/dev/null", O_RDWR));
    xc_crash_rethrow = rethrow;
    if(NULL == (xc_crash_emergency = calloc(XC_CRASH_EMERGENCY_BUF_LEN, 1))) return XCC_ERRNO_NOMEM;
    if(NULL == (xc_crash_dumper_pathname = xc_util_strdupcat(xc_common_app_lib_dir, "/"XCC_UTIL_XCRASH_DUMPER_FILENAME))) return XCC_ERRNO_NOMEM;

    //init the local unwinder for fallback mode
    xcc_unwind_init(xc_common_api_level);

    //init for JNI callback
    xc_crash_init_callback(env);

    //struct info passed to the dumper process
    memset(&xc_crash_spot, 0, sizeof(xcc_spot_t));
    xc_crash_spot.api_level = xc_common_api_level;
    xc_crash_spot.crash_pid = xc_common_process_id;
    xc_crash_spot.start_time = xc_common_start_time;
    xc_crash_spot.time_zone = xc_common_time_zone;
    xc_crash_spot.logcat_system_lines = logcat_system_lines;
    xc_crash_spot.logcat_events_lines = logcat_events_lines;
    xc_crash_spot.logcat_main_lines = logcat_main_lines;
    xc_crash_spot.dump_elf_hash = dump_elf_hash;
    xc_crash_spot.dump_map = dump_map;
    xc_crash_spot.dump_fds = dump_fds;
    xc_crash_spot.dump_network_info = dump_network_info;
    xc_crash_spot.dump_all_threads = dump_all_threads;
    xc_crash_spot.dump_all_threads_count_max = dump_all_threads_count_max;
    xc_crash_spot.os_version_len = strlen(xc_common_os_version);
    xc_crash_spot.kernel_version_len = strlen(xc_common_kernel_version);
    xc_crash_spot.abi_list_len = strlen(xc_common_abi_list);
    xc_crash_spot.manufacturer_len = strlen(xc_common_manufacturer);
    xc_crash_spot.brand_len = strlen(xc_common_brand);
    xc_crash_spot.model_len = strlen(xc_common_model);
    xc_crash_spot.build_fingerprint_len = strlen(xc_common_build_fingerprint);
    xc_crash_spot.app_id_len = strlen(xc_common_app_id);
    xc_crash_spot.app_version_len = strlen(xc_common_app_version);
    xc_crash_init_dump_all_threads_whitelist(dump_all_threads_whitelist, dump_all_threads_whitelist_len);

    //for clone and fork
#ifndef __i386__
    if(NULL == (xc_crash_child_stack = calloc(XC_CRASH_CHILD_STACK_LEN, 1))) return XCC_ERRNO_NOMEM;
    xc_crash_child_stack = (void *)(((uint8_t *)xc_crash_child_stack) + XC_CRASH_CHILD_STACK_LEN);
#else
    if(0 != pipe2(xc_crash_child_notifier, O_CLOEXEC)) return XCC_ERRNO_SYS;
#endif
    
    //register signal handler
    return xcc_signal_crash_register(xc_crash_signal_handler);
}

首先初始化unwind用来获取backtrace,然后把java的回调方法保存起来,然后注册signal的监听。

int xcc_signal_crash_register(void (*handler)(int, siginfo_t *, void *))
{
    stack_t ss;
    if(NULL == (ss.ss_sp = calloc(1, XCC_SIGNAL_CRASH_STACK_SIZE))) return XCC_ERRNO_NOMEM;
    ss.ss_size  = XCC_SIGNAL_CRASH_STACK_SIZE;
    ss.ss_flags = 0;
    if(0 != sigaltstack(&ss, NULL)) return XCC_ERRNO_SYS;

    struct sigaction act;
    memset(&act, 0, sizeof(act));
    sigfillset(&act.sa_mask);
    act.sa_sigaction = handler;
    act.sa_flags = SA_RESTART | SA_SIGINFO | SA_ONSTACK;
    
    size_t i;
    for(i = 0; i < sizeof(xcc_signal_crash_info) / sizeof(xcc_signal_crash_info[0]); i++)
        if(0 != sigaction(xcc_signal_crash_info[i].signum, &act, &(xcc_signal_crash_info[i].oldact)))
            return XCC_ERRNO_SYS;

    return 0;
}

在注册的时候,xcc_signal_crash_info是一个signal列表,具体信号含义我标注在下面代码里面。

static xcc_signal_crash_info_t xcc_signal_crash_info[] =
{
    {.signum = SIGABRT}, // 调用 abort() / kill() / tkill() / tgkill() 自杀,或被其他进程通过 kill() / tkill() / tgkill() 他杀
    {.signum = SIGBUS},  // 错误的物理设备地址访问
    {.signum = SIGFPE},  // 除数为零
    {.signum = SIGILL},  // 无法识别的 CPU 指令
    {.signum = SIGSEGV}, // 错误的虚拟内存地址访问
    {.signum = SIGTRAP}, // 断点或陷阱指令
    {.signum = SIGSYS},  // 无法识别的系统调用(system call)
    {.signum = SIGSTKFLT}// 栈溢出
};
static void xc_crash_signal_handler(int sig, siginfo_t *si, void *uc)
{
    // 略

    pid_t dumper_pid = xc_crash_fork(xc_crash_exec_dumper);

    // 略
}

在处理signal的时候注意这里,会启动一个新的进程来做。这样做可以避开 async-signal-safe 的限制、避开虚拟内存地址耗尽的问题、避开 FD 耗尽的问题、使用 ptrace() suspend 崩溃进程中所有的线程、除了崩溃线程本身的 registers、backtrace 等,还能用 ptrace()收集到进程中其他所有线程的 registers、backtrace 等信息,还可以更安全的读取内存数据。

0x05. 注意

在读源码的时候发现几个值得注意的点

  • 防止资源被耗尽,所以在初始化的时候先创建文件

    //create prepared FD for FD exhausted case
    ​xc_common_open_prepared_fd(1); 
    ​xc_common_open_prepared_fd(0);
    
  • 监听信号有两种方式

    1.sigwait: 堵塞的,多个线程用这个方式,无法确定谁会收到,一个收到别的就收不到 2.sigaction: 建立一个signal handler,非堵塞的,但是Android把SIGQUIT信号默认设置为BLOCKED,需要我们通过pthread_sigmask或者sigprocmask把SIGQUIT设置为UNBLOCK,再去建立signal handler。

  • 监听信号之后,还需要向Signal Catcher发送一次正常的信号,不然系统不会再收到相同信号处理anr

    nt tid = getSignalCatcherThreadId(); //遍历/proc/[pid]目录,找到SignalCatcher线程的tidtgkill(getpid(), tid, SIGQUIT);
    

0x06. 总结

最后把整理的脑图放在这里,欢迎来交流。

xCrash.png