NativeTombstoneManagerService & Native Crash流程 -基于Android V(15)

183 阅读8分钟

前言

我们基于Android 15的源码分析一下NativeTombstoneManagerService的作用、dropbox 去重规则、Native Crash 流程,并从native crash角度整理下dropbox文件的来源。

NativeTombstoneManagerService

启动

NativeTombstoneManagerService在systemserver的startCoreServices函数中启动,主要的逻辑有下面2个方面:

  • 在onStart方法中创建NativeTombstoneManager,负责监听“/data/tombstones"目录下文件(CREATE和MOVED_TO)变化,监听到文件变化时,调用handleTombstone函数处理;
  • 在onBootPhase函数中调用onSystemReady函数,监听user和package变化,调用handleTombstone函数处理当前系统中的tombstone文件;

启动的时序图如下:

NativeTombstoneManagerService.png

Dropbox过滤规则

设想一下如果一个应用一直崩溃,那么这个异常就要一直添加到dropbox中吗,当然不是,这里存在过滤规则 ,我们上面的流程图中也涉及到了,我们介绍下过滤规则

前置条件

过滤条件在类DropboxRateLimiter中,这个类的初始化中几个静态变量决定了我们过滤时的判断条件,我们整理如下,如果系统中未设置,都按默认值处理:

变量默认值
mRateLimitBufferDuration// After RATE_LIMIT_ALLOWED_ENTRIES have been collected (for a single breakdown of// process/eventType) further entries will be rejected until RATE_LIMIT_BUFFER_DURATION has// elapsed, after which the current count for this breakdown will be reset.private static final long RATE_LIMIT_BUFFER_DURATION_DEFAULT = 10 * DateUtils.MINUTE_IN_MILLIS;当某个进程/事件类型的分类维度(breakdown)下已收集到 RATE_LIMIT_ALLOWED_ENTRIES 条数据后,后续条目将被拒绝,直至经过 RATE_LIMIT_BUFFER_DURATION 时间间隔,此后该分类维度的当前计数将重置
mRateLimitBufferExpiryFactor// Indicated how many buffer durations to wait before the rate limit buffer will be cleared.// E.g. if set to 3 will wait 3xRATE_LIMIT_BUFFER_DURATION before clearing the buffer.private static final long RATE_LIMIT_BUFFER_EXPIRY_FACTOR_DEFAULT = 3;表示需等待的 ​缓冲时长倍数​​(用于确定何时清除速率限制缓冲区)。例如:若设为3,系统将在清除缓冲区前等待 3 × RATE_LIMIT_BUFFER_DURATION 时长
mRateLimitAllowedEntries// The number of entries to keep per breakdown of process/eventType.private static final int RATE_LIMIT_ALLOWED_ENTRIES_DEFAULT = 6;在每个RATE_LIMIT_BUFFER_DURATION_DEFAULT 中,同一process/eventType 最多记录6次
mStrictRatelimitAllowedEntries// If a process is rate limited twice in a row we consider it crash-looping and rate limit it more aggressively.private static final int STRICT_RATE_LIMIT_ALLOWED_ENTRIES_DEFAULT = 1;若某个进程连续两次触发速率限制,则判定其处于崩溃循环(crash-looping)状态,并对其施加更严格的速率限制策略
mStrictRateLimitBufferDurationprivate static final long STRICT_RATE_LIMIT_BUFFER_DURATION_DEFAULT = 20 * DateUtils.MINUTE_IN_MILLIS;

过滤规则

当有tombstone文件生成的时候,回调handleTombstone方法,在函数addTombstoneToDropBox的处理中,会根据DropboxRateLimiter类中的shouldRateLimit方法进行过滤:

    /** Determines whether dropbox entries of a specific tag and process should be rate limited. */
    public RateLimitResult shouldRateLimit(String eventType, String processName) {
        // Rate-limit how often we're willing to do the heavy lifting to collect and record logs.
        final long now = mClock.uptimeMillis();
        synchronized (mErrorClusterRecords) {
            // Remove expired records if enough time has passed since the last cleanup.
            maybeRemoveExpiredRecords(now);

            ErrorRecord errRecord = mErrorClusterRecords.get(errorKey(eventType, processName));
            if (errRecord == null) {
                errRecord = new ErrorRecord(now, 1);
                mErrorClusterRecords.put(errorKey(eventType, processName), errRecord);
                return new RateLimitResult(false, 0);
            }

            final long timeSinceFirstError = now - errRecord.getStartTime();
            if (timeSinceFirstError > errRecord.getBufferDuration()) {
                final int errCount = recentlyDroppedCount(errRecord);
                errRecord.setStartTime(now);
                errRecord.setCount(1);

                // If this error happened exactly the next "rate limiting cycle" after the last
                // error and the previous cycle was rate limiting then increment the successive
                // rate limiting cycle counter. If a full "cycle" has passed since the last error
                // then this is no longer a continuous occurrence and will be rate limited normally.
                if (errCount > 0 && timeSinceFirstError < 2 * errRecord.getBufferDuration()) {
                    errRecord.incrementSuccessiveRateLimitCycles();
                } else {
                    errRecord.setSuccessiveRateLimitCycles(0);
                }

                return new RateLimitResult(false, errCount);
            }

            errRecord.incrementCount();
            if (errRecord.getCount() > errRecord.getAllowedEntries()) {
                return new RateLimitResult(true, recentlyDroppedCount(errRecord));
            }
        }
        return new RateLimitResult(false, 0);
    }

我们先要了解下mErrorClusterRecords和ErrorRecord这个类,mErrorClusterRecords是一个ArrayMap,key 是“eventType + processName”,value 是ErrorRecord

   @GuardedBy("mErrorClusterRecords")
    private final ArrayMap<String, ErrorRecord> mErrorClusterRecords = new ArrayMap<>();
    private class ErrorRecord {
        long mStartTime;
        int mCount;
        int mSuccessiveRateLimitCycles;

        ErrorRecord(long startTime, int count) {
            mStartTime = startTime;
            mCount = count;
            mSuccessiveRateLimitCycles = 0;
        }

下面我们通过表格的方式来整理一下shouldRateLimit过滤规则,进入函数之后先调用maybeRemoveExpiredRecords,对已经超过mRateLimitBufferExpiryFactor * mRateLimitBufferDuration(30分钟)的ErrorRecord进行移除:

条件是否过滤
ErrorRecord errRecord = mErrorClusterRecords.get(errorKey(eventType, processName));if (errRecord == null) {errRecord = new ErrorRecord(now, 1);mErrorClusterRecords.put(errorKey(eventType, processName), errRecord);return new RateLimitResult(false, 0);}errRecord 证明该“eventType + processName”最近30分钟从未发生过(当然也包括从未发生过)tombstone,此时创建ErrorRecord,异常发生时间记录为当前时间,次数记录为1, 并添加到mErrorClusterRecords,shouldRateLimit返回false(不过滤)
final long timeSinceFirstError = now - errRecord.getStartTime();if (timeSinceFirstError > errRecord.getBufferDuration()) {final int errCount = recentlyDroppedCount(errRecord);errRecord.setStartTime(now);errRecord.setCount(1);// If this error happened exactly the next "rate limiting cycle" after the last// error and the previous cycle was rate limiting then increment the successive// rate limiting cycle counter. If a full "cycle" has passed since the last error// then this is no longer a continuous occurrence and will be rate limited normally.if (errCount > 0 && timeSinceFirstError < 2 * errRecord.getBufferDuration()) {errRecord.incrementSuccessiveRateLimitCycles();} else {errRecord.setSuccessiveRateLimitCycles(0);}return new RateLimitResult(false, errCount);}errRecord不为空,证明该“eventType + processName”最近30分钟以内发生过tombstone,获取它的发生时间,并和当前时间做差值;如果上次异常距离现在已经超过BufferDuration(10分钟或者20分钟,依赖isRepeated是否为true),通过recentlyDroppedCount函数获取errCount;如果次数小于getAllowedEntries(6次),errCount为0,否则为errRecord.getCount()-6;private int recentlyDroppedCount(ErrorRecord errRecord) {if (errRecord == null 或这 errRecord.getCount() < errRecord.getAllowedEntries()) return 0;return errRecord.getCount() - errRecord.getAllowedEntries();}如果errCount大于0,并且时间差值小于2倍的BufferDuration,通过incrementSuccessiveRateLimitCycles增加mSuccessiveRateLimitCycles值,如不满足上述条件mSuccessiveRateLimitCycles为0;shouldRateLimit返回false(不过滤)
errRecord.incrementCount();if (errRecord.getCount() > errRecord.getAllowedEntries()) {return new RateLimitResult(true, recentlyDroppedCount(errRecord));}如果上述条件都没满足,即异常已经发生过,并且到现在的时间差值小于BufferDuration,并且已经发生了超过AllowedEntries(6次),shouldRateLimit返回true(过滤)

另一条通路

我们知道AMS在启动之后内部会创建NativeCrashListener,也会监听native crash,并将异常添加到dropbox,这是将native crash添加到dropbox的“另一条通路”。

linker

  1. 当一个 Android Native 程序(如 APK 中的 JNI 库或命令行工具)启动时,其 ​ELF 文件头部​​ 会指定动态链接器(即 linker)的路径。通过 ​​PT_INTERP 段​​ 告知内核需要加载哪个解释器(interpreter)。当用户执可执行程序时,内核解析其 ELF 头,发现 PT_INTERP 段指向 /system/bin/linker64
  2. 内核将 ​linker 本身​​ 作为解释器加载到内存(mmap),并跳转到 linker 的入口点(__linker_init),而非直接执行 可执行程序 的 main 函数,在此流程中会给native 程序设置默认的异常信号处理的handler。
  3. linker 负责完成动态库加载和重定位后,最终跳转回可执行文件的真实入口(main 函数)

debuggerd

  1. linker在init函数中给可执行程序设置了默认信号处理函数debuggerd_signal_handler,当有异常信号发生时就走到处理逻辑中
  2. debuggerd_signal_handler 首先打印异常信息log_signal_summary
919   919 F libc    : Fatal signal 11 (SIGSEGV), code 0 (SI_USER from pid 584, uid 0) in tid 919 (ndroid.settings), pid 919 (ndroid.settings)
  1. 然后通过exec的方式执行crash dump,进行异常信息的打印收集动作
#define CRASH_DUMP_NAME "crash_dump64"
#define CRASH_DUMP_PATH "/apex/com.android.runtime/bin/" CRASH_DUMP_NAME
    execle(CRASH_DUMP_PATH, CRASH_DUMP_NAME, main_tid, pseudothread_tid, debuggerd_dump_type,
           nullptr, nullptr);

crash_dump

  1. crash dump 通过tombstoned_connect方式和tombstoned建立socket(tombstoned_crash)通信连接
  2. 建立连接之后调用tombstone的相关debug函数,打印并保存异常堆栈(libunwind)信息和寄存器信息
  3. 完成异常信息收集之后,发送socket信息给AMS,告知XXX发生了Native Crash
  4. 最后通过socket告知tombstoned该次crash dump已经处理完成

AMS

  1. AMS 在systemready函数中启动NativeCrashListener,循环监听socket(/data/system/ndebugsocket)发来的事件
  2. 当有native crash发生时,crash_dump(64or32)在获取完寄存器信息和相关堆栈之后,会通过socket通知AMS
  3. AMS监听到socket信息,执行consumeNativeCrashData函数,最终调用handleApplicationCrashInner("native_crash") 添加到dropbox

tombstoned

  1. 属于system group,开机默认启动,main函数中默认设置socket(tombstoned_crash)监听,连接回调函数crash_accept_cb
service tombstoned /system/bin/tombstoned
    user tombstoned
    group system

    socket tombstoned_crash seqpacket 0666 system system
    socket tombstoned_intercept seqpacket 0666 system system
    socket tombstoned_java_trace seqpacket 0666 system system
    writepid /dev/cpuset/system-background/tasks
  1. 当crash dump和tombstoned建立连接时,触发回调函数crash_accept_cb,并设置结束回调函数crash_completed_cb
217   217 I tombstoned: received crash request for pid 919
  1. 当crash dump结束dump动作时,tombstoned受到结束回调,写入tombstone文件
217   217 E tombstoned: Tombstone written to: tombstone_28

整个流程如下:

Native_Crash.png

总结

基于Android 15源码分析,目前Android系统中关于native crash添加到dropbox中有2条通路:

  • NativeTombstoneManager中监听tombstone路径变化,handleTombstone函数将tombstone文件添加到dropbox中
  • AMS中通过NativeCrashListener监听native crash,最终添加到dropbox中

我们基于Android 15模拟器,启动一个demo应用,使用kill -11 pid的方式验证,我们看到tomstone路径下生成了两个文件:

emu64x:/data/tombstones # ls -al
total 772
drwxrwxr-x  2 system     system   4096 2025-05-28 14:52 .
drwxrwx--x 51 system     system   4096 2025-05-28 14:51 ..
-rw-rw-r--  1 tombstoned system 462008 2025-05-28 14:52 tombstone_00
-rw-rw-r--  1 tombstoned system 299220 2025-05-28 14:52 tombstone_00.pb

NativeTombstoneManager通过handleTombstone方法将这两个文件添加到dropbox中:

-rw-------  1 system system 14273 2025-05-28 14:52 SYSTEM_TOMBSTONE@1748415174157.txt.gz
-rw-------  1 system system 56406 2025-05-28 14:52 SYSTEM_TOMBSTONE_PROTO_WITH_HEADERS@1748415174142.dat.gz

AMS通过NativeCrashListener监听native crash,将下面这个文件添加到dropbox中

-rw-------  1 system system 16306 2025-05-28 14:52 data_app_native_crash@1748415174124.txt