对于优秀的三方库,我们要做到知其然,也知其所以然。看到很多技术博客分析理解EventBus、ARouter、LeakCanary、Retrofit、Glide、OkHttp这些非常知名的开源库,已然十分全面了。所以我打算来写一系列除了这几个知名库,略小众的但是也是十分有用的三方库源码分析,希望对你有帮助。
0x01. xCrash简介
xCrash 是爱奇艺开源的Android崩溃捕获框架,能为安卓 app 提供捕获 java 崩溃,native 崩溃和 ANR 的能力。不需要 root 权限或任何系统权限。 app 进程崩溃或 ANR 时,在你指定的目录中生成一个 tombstone 文件(格式与安卓系统的 tombstone 文件类似)。在 爱奇艺 的不同平台(手机,平板,电视)的很多安卓 app(包括爱奇艺视频)中被使用了很多年。
0x02. Java Crash
捕捉Java Crash是通过自定义UncaughtExceptionHandler,然后使用Thread.setDefaultUncaughtExceptionHandler(this)把自定义的UncaughtExceptionHandler设置进去。
class JavaCrashHandler implements UncaughtExceptionHandler {
/**略**/
void initialize(int pid, String processName, String appId, String appVersion, String logDir, boolean rethrow,
int logcatSystemLines, int logcatEventsLines, int} logcatMainLines,
boolean dumpFds, boolean dumpNetworkInfo, boolean dumpAllThreads, int dumpAllThreadsCountMax, String[] dumpAllThreadsWhiteList,
ICrashCallback callback) {
this.pid = pid;
this.processName = (TextUtils.isEmpty(processName) ? "unknown" : processName);
this.appId = appId;
this.appVersion = appVersion;
this.rethrow = rethrow;
this.logDir = logDir;
this.logcatSystemLines = logcatSystemLines;
this.logcatEventsLines = logcatEventsLines;
this.logcatMainLines = logcatMainLines;
this.dumpFds = dumpFds;
this.dumpNetworkInfo = dumpNetworkInfo;
this.dumpAllThreads = dumpAllThreads;
this.dumpAllThreadsCountMax = dumpAllThreadsCountMax;
this.dumpAllThreadsWhiteList = dumpAllThreadsWhiteList;
this.callback = callback;
this.defaultHandler = Thread.getDefaultUncaughtExceptionHandler();
try {
Thread.setDefaultUncaughtExceptionHandler(this);
} catch (Exception e) {
XCrash.getLogger().e(Util.TAG, "JavaCrashHandler setDefaultUncaughtExceptionHandler failed", e);
}
}
/**略**/
}
在这里注意这一句
this.defaultHandler = Thread.getDefaultUncaughtExceptionHandler();
这里把默认的UncaughtExceptionHandler保存下来是为了我们在自定义的UncaughtExceptionHandler中生成Tombstone之后在把UncaughtExceptionHandler还原回去,交给系统处理。
@Override
public void uncaughtException(Thread thread, Throwable throwable) {
if (defaultHandler != null) {
Thread.setDefaultUncaughtExceptionHandler(defaultHandler);
}
try {
handleException(thread, throwable);
} catch (Exception e) {
XCrash.getLogger().e(Util.TAG, "JavaCrashHandler handleException failed", e);
}
if (this.rethrow) {
if (defaultHandler != null) {
defaultHandler.uncaughtException(thread, throwable);
}
} else {
ActivityMonitor.getInstance().finishAllActivities();
Process.killProcess(this.pid);
System.exit(10);
}
}
0x03. Anr Crash
xCrash的Anr Crash的捕获要分开来谈
-
API level < 21
class AnrHandler { /**略**/ void initialize(Context ctx, int pid, String processName, String appId, String appVersion, String logDir, boolean checkProcessState, int logcatSystemLines, int logcatEventsLines, int logcatMainLines, boolean dumpFds, boolean dumpNetworkInfo, ICrashCallback callback) { //check API level if (Build.VERSION.SDK_INT >= 21) { return; } this.ctx = ctx; this.pid = pid; this.processName = (TextUtils.isEmpty(processName) ? "unknown" : processName); this.appId = appId; this.appVersion = appVersion; this.logDir = logDir; this.checkProcessState = checkProcessState; this.logcatSystemLines = logcatSystemLines; this.logcatEventsLines = logcatEventsLines; this.logcatMainLines = logcatMainLines; this.dumpFds = dumpFds; this.dumpNetworkInfo = dumpNetworkInfo; this.callback = callback; fileObserver = new FileObserver("/data/anr/", CLOSE_WRITE) { public void onEvent(int event, String path) { try { if (path != null) { String filepath = "/data/anr/" + path; if (filepath.contains("trace")) { handleAnr(filepath); } } } catch (Exception e) { XCrash.getLogger().e(Util.TAG, "AnrHandler fileObserver onEvent failed", e); } } }; try { fileObserver.startWatching(); } catch (Exception e) { fileObserver = null; XCrash.getLogger().e(Util.TAG, "AnrHandler fileObserver startWatching failed", e); } } /**略**/ }
在Android5.0以前可以通过注册一个FileObserver来监听/data/anr/下面的文件来确定Anr是否发生,但是这个方法在更高的版本已经失效了。
-
API level >= 21
在更高的版本上面,xCrash是在native层监听signal信号来确定Anr是否发生
int xc_trace_init(JNIEnv *env, int rethrow, unsigned int logcat_system_lines, unsigned int logcat_events_lines, unsigned int logcat_main_lines, int dump_fds, int dump_network_info) { int r; pthread_t thd; //capture SIGQUIT only for ART if(xc_common_api_level < 21) return 0; //is Android Lollipop (5.x)? xc_trace_is_lollipop = ((21 == xc_common_api_level || 22 == xc_common_api_level) ? 1 : 0); xc_trace_dump_status = XC_TRACE_DUMP_NOT_START; xc_trace_rethrow = rethrow; xc_trace_logcat_system_lines = logcat_system_lines; xc_trace_logcat_events_lines = logcat_events_lines; xc_trace_logcat_main_lines = logcat_main_lines; xc_trace_dump_fds = dump_fds; xc_trace_dump_network_info = dump_network_info; //init for JNI callback xc_trace_init_callback(env); //create event FD if(0 > (xc_trace_notifier = eventfd(0, EFD_CLOEXEC))) return XCC_ERRNO_SYS; //register signal handler if(0 != (r = xcc_signal_trace_register(xc_trace_handler))) goto err2; //create thread for dump trace if(0 != (r = pthread_create(&thd, NULL, xc_trace_dumper, NULL))) goto err1; return 0; err1: xcc_signal_trace_unregister(); err2: close(xc_trace_notifier); xc_trace_notifier = -1; return r; }
注意这一句代码
if(0 > (xc_trace_notifier = eventfd(0, EFD_CLOEXEC))) return XCC_ERRNO_SYS;
这里先注册event FD,是因为FD 泄露是常见的导致进程崩溃的间接原因。这意味着在 signal handler 中无法正常的使用依赖于 FD 的操作,比如无法 open() + read() 读取/proc 中的各种信息。为了不干扰 APP 的正常运行,预留了一个 FD,用于在崩溃时可靠的创建出“崩溃信息记录文件”。
int xcc_signal_trace_register(void (*handler)(int, siginfo_t *, void *)) { int r; sigset_t set; struct sigaction act; //un-block the SIGQUIT mask for current thread, hope this is the main thread sigemptyset(&set); sigaddset(&set, SIGQUIT); if(0 != (r = pthread_sigmask(SIG_UNBLOCK, &set, &xcc_signal_trace_oldset))) return r; //register new signal handler for SIGQUIT memset(&act, 0, sizeof(act)); sigfillset(&act.sa_mask); act.sa_sigaction = handler; act.sa_flags = SA_RESTART | SA_SIGINFO; if(0 != sigaction(SIGQUIT, &act, &xcc_signal_trace_oldact)) { pthread_sigmask(SIG_SETMASK, &xcc_signal_trace_oldset, NULL); return XCC_ERRNO_SYS; } return 0; }
在这里,首先通过pthread_sigmask把SIGQUIT信号设置为UNBLOCK, 然后通过sigaction方法注册signal handler,监听sigquit信号;sigaction()三个参数,第一个是需要监听的信号,第二个是新的action,第三个是旧的action。 action中我们接收到回调回修改xc_trace_notifier。
static void *xc_trace_dumper(void *arg) { JNIEnv *env = NULL; uint64_t data; uint64_t trace_time; int fd; struct timeval tv; char pathname[1024]; jstring j_pathname; (void)arg; pthread_detach(pthread_self()); JavaVMAttachArgs attach_args = { .version = XC_JNI_VERSION, .name = "xcrash_trace_dp", .group = NULL }; if(JNI_OK != (*xc_common_vm)->AttachCurrentThread(xc_common_vm, &env, &attach_args)) goto exit; while(1) { //block here, waiting for sigquit XCC_UTIL_TEMP_FAILURE_RETRY(read(xc_trace_notifier, &data, sizeof(data))); //check if process already crashed if(xc_common_native_crashed || xc_common_java_crashed) break; //trace time if(0 != gettimeofday(&tv, NULL)) break; trace_time = (uint64_t)(tv.tv_sec) * 1000 * 1000 + (uint64_t)tv.tv_usec; //Keep only one current trace. if(0 != xc_trace_logs_clean()) continue; //create and open log file if((fd = xc_common_open_trace_log(pathname, sizeof(pathname), trace_time)) < 0) continue; //write header info if(0 != xc_trace_write_header(fd, trace_time)) goto end; //write trace info from ART runtime if(0 != xcc_util_write_format(fd, XCC_UTIL_THREAD_SEP"Cmd line: %s\n", xc_common_process_name)) goto end; if(0 != xcc_util_write_str(fd, "Mode: ART DumpForSigQuit\n")) goto end; if(0 != xc_trace_load_symbols()) { if(0 != xcc_util_write_str(fd, "Failed to load symbols.\n")) goto end; goto skip; } if(0 != xc_trace_check_address_valid()) { if(0 != xcc_util_write_str(fd, "Failed to check runtime address.\n")) goto end; goto skip; } if(dup2(fd, STDERR_FILENO) < 0) { if(0 != xcc_util_write_str(fd, "Failed to duplicate FD.\n")) goto end; goto skip; } xc_trace_dump_status = XC_TRACE_DUMP_ON_GOING; if(sigsetjmp(jmpenv, 1) == 0) { if(xc_trace_is_lollipop) xc_trace_libart_dbg_suspend(); xc_trace_libart_runtime_dump(*xc_trace_libart_runtime_instance, xc_trace_libcpp_cerr); if(xc_trace_is_lollipop) xc_trace_libart_dbg_resume(); } else { fflush(NULL); XCD_LOG_WARN("longjmp to skip dumping trace\n"); } dup2(xc_common_fd_null, STDERR_FILENO); skip: if(0 != xcc_util_write_str(fd, "\n"XCC_UTIL_THREAD_END"\n")) goto end; //write other info if(0 != xcc_util_record_logcat(fd, xc_common_process_id, xc_common_api_level, xc_trace_logcat_system_lines, xc_trace_logcat_events_lines, xc_trace_logcat_main_lines)) goto end; if(xc_trace_dump_fds) if(0 != xcc_util_record_fds(fd, xc_common_process_id)) goto end; if(xc_trace_dump_network_info) if(0 != xcc_util_record_network_info(fd, xc_common_process_id, xc_common_api_level)) goto end; if(0 != xcc_meminfo_record(fd, xc_common_process_id)) goto end; end: //close log file xc_common_close_trace_log(fd); //rethrow SIGQUIT to ART Signal Catcher if(xc_trace_rethrow && (XC_TRACE_DUMP_ART_CRASH != xc_trace_dump_status)) xc_trace_send_sigquit(); xc_trace_dump_status = XC_TRACE_DUMP_END; //JNI callback //Do we need to implement an emergency buffer for disk exhausted? if(NULL == xc_trace_cb_method) continue; if(NULL == (j_pathname = (*env)->NewStringUTF(env, pathname))) continue; (*env)->CallStaticVoidMethod(env, xc_common_cb_class, xc_trace_cb_method, j_pathname, NULL); XC_JNI_IGNORE_PENDING_EXCEPTION(); (*env)->DeleteLocalRef(env, j_pathname); } (*xc_common_vm)->DetachCurrentThread(xc_common_vm); exit: xc_trace_notifier = -1; close(xc_trace_notifier); return NULL; }
在新开的线程中,有一个死循环,监听xc_trace_notifier,如果为1的时候说明发生了anr,记录线程,然后调用xc_trace_init_callback()中保存下来的java回调方法,通知java层完成监控。
0x04. Native Crash
捕获native Crash的流程和高版本捕获anr的流程有些相似
int xc_crash_init(JNIEnv *env,
int rethrow,
unsigned int logcat_system_lines,
unsigned int logcat_events_lines,
unsigned int logcat_main_lines,
int dump_elf_hash,
int dump_map,
int dump_fds,
int dump_network_info,
int dump_all_threads,
unsigned int dump_all_threads_count_max,
const char **dump_all_threads_whitelist,
size_t dump_all_threads_whitelist_len)
{
xc_crash_prepared_fd = XCC_UTIL_TEMP_FAILURE_RETRY(open("/dev/null", O_RDWR));
xc_crash_rethrow = rethrow;
if(NULL == (xc_crash_emergency = calloc(XC_CRASH_EMERGENCY_BUF_LEN, 1))) return XCC_ERRNO_NOMEM;
if(NULL == (xc_crash_dumper_pathname = xc_util_strdupcat(xc_common_app_lib_dir, "/"XCC_UTIL_XCRASH_DUMPER_FILENAME))) return XCC_ERRNO_NOMEM;
//init the local unwinder for fallback mode
xcc_unwind_init(xc_common_api_level);
//init for JNI callback
xc_crash_init_callback(env);
//struct info passed to the dumper process
memset(&xc_crash_spot, 0, sizeof(xcc_spot_t));
xc_crash_spot.api_level = xc_common_api_level;
xc_crash_spot.crash_pid = xc_common_process_id;
xc_crash_spot.start_time = xc_common_start_time;
xc_crash_spot.time_zone = xc_common_time_zone;
xc_crash_spot.logcat_system_lines = logcat_system_lines;
xc_crash_spot.logcat_events_lines = logcat_events_lines;
xc_crash_spot.logcat_main_lines = logcat_main_lines;
xc_crash_spot.dump_elf_hash = dump_elf_hash;
xc_crash_spot.dump_map = dump_map;
xc_crash_spot.dump_fds = dump_fds;
xc_crash_spot.dump_network_info = dump_network_info;
xc_crash_spot.dump_all_threads = dump_all_threads;
xc_crash_spot.dump_all_threads_count_max = dump_all_threads_count_max;
xc_crash_spot.os_version_len = strlen(xc_common_os_version);
xc_crash_spot.kernel_version_len = strlen(xc_common_kernel_version);
xc_crash_spot.abi_list_len = strlen(xc_common_abi_list);
xc_crash_spot.manufacturer_len = strlen(xc_common_manufacturer);
xc_crash_spot.brand_len = strlen(xc_common_brand);
xc_crash_spot.model_len = strlen(xc_common_model);
xc_crash_spot.build_fingerprint_len = strlen(xc_common_build_fingerprint);
xc_crash_spot.app_id_len = strlen(xc_common_app_id);
xc_crash_spot.app_version_len = strlen(xc_common_app_version);
xc_crash_init_dump_all_threads_whitelist(dump_all_threads_whitelist, dump_all_threads_whitelist_len);
//for clone and fork
#ifndef __i386__
if(NULL == (xc_crash_child_stack = calloc(XC_CRASH_CHILD_STACK_LEN, 1))) return XCC_ERRNO_NOMEM;
xc_crash_child_stack = (void *)(((uint8_t *)xc_crash_child_stack) + XC_CRASH_CHILD_STACK_LEN);
#else
if(0 != pipe2(xc_crash_child_notifier, O_CLOEXEC)) return XCC_ERRNO_SYS;
#endif
//register signal handler
return xcc_signal_crash_register(xc_crash_signal_handler);
}
首先初始化unwind用来获取backtrace,然后把java的回调方法保存起来,然后注册signal的监听。
int xcc_signal_crash_register(void (*handler)(int, siginfo_t *, void *))
{
stack_t ss;
if(NULL == (ss.ss_sp = calloc(1, XCC_SIGNAL_CRASH_STACK_SIZE))) return XCC_ERRNO_NOMEM;
ss.ss_size = XCC_SIGNAL_CRASH_STACK_SIZE;
ss.ss_flags = 0;
if(0 != sigaltstack(&ss, NULL)) return XCC_ERRNO_SYS;
struct sigaction act;
memset(&act, 0, sizeof(act));
sigfillset(&act.sa_mask);
act.sa_sigaction = handler;
act.sa_flags = SA_RESTART | SA_SIGINFO | SA_ONSTACK;
size_t i;
for(i = 0; i < sizeof(xcc_signal_crash_info) / sizeof(xcc_signal_crash_info[0]); i++)
if(0 != sigaction(xcc_signal_crash_info[i].signum, &act, &(xcc_signal_crash_info[i].oldact)))
return XCC_ERRNO_SYS;
return 0;
}
在注册的时候,xcc_signal_crash_info是一个signal列表,具体信号含义我标注在下面代码里面。
static xcc_signal_crash_info_t xcc_signal_crash_info[] =
{
{.signum = SIGABRT}, // 调用 abort() / kill() / tkill() / tgkill() 自杀,或被其他进程通过 kill() / tkill() / tgkill() 他杀
{.signum = SIGBUS}, // 错误的物理设备地址访问
{.signum = SIGFPE}, // 除数为零
{.signum = SIGILL}, // 无法识别的 CPU 指令
{.signum = SIGSEGV}, // 错误的虚拟内存地址访问
{.signum = SIGTRAP}, // 断点或陷阱指令
{.signum = SIGSYS}, // 无法识别的系统调用(system call)
{.signum = SIGSTKFLT}// 栈溢出
};
static void xc_crash_signal_handler(int sig, siginfo_t *si, void *uc)
{
// 略
pid_t dumper_pid = xc_crash_fork(xc_crash_exec_dumper);
// 略
}
在处理signal的时候注意这里,会启动一个新的进程来做。这样做可以避开 async-signal-safe 的限制、避开虚拟内存地址耗尽的问题、避开 FD 耗尽的问题、使用 ptrace() suspend 崩溃进程中所有的线程、除了崩溃线程本身的 registers、backtrace 等,还能用 ptrace()收集到进程中其他所有线程的 registers、backtrace 等信息,还可以更安全的读取内存数据。
0x05. 注意
在读源码的时候发现几个值得注意的点
-
防止资源被耗尽,所以在初始化的时候先创建文件
//create prepared FD for FD exhausted case xc_common_open_prepared_fd(1); xc_common_open_prepared_fd(0);
-
监听信号有两种方式
1.sigwait: 堵塞的,多个线程用这个方式,无法确定谁会收到,一个收到别的就收不到 2.sigaction: 建立一个signal handler,非堵塞的,但是Android把SIGQUIT信号默认设置为BLOCKED,需要我们通过pthread_sigmask或者sigprocmask把SIGQUIT设置为UNBLOCK,再去建立signal handler。
-
监听信号之后,还需要向Signal Catcher发送一次正常的信号,不然系统不会再收到相同信号处理anr
nt tid = getSignalCatcherThreadId(); //遍历/proc/[pid]目录,找到SignalCatcher线程的tidtgkill(getpid(), tid, SIGQUIT);
0x06. 总结
最后把整理的脑图放在这里,欢迎来交流。