NativeTombstoneManagerService & Native Crash流程 -基于Android V（15）

前言

我们基于Android 15的源码分析一下NativeTombstoneManagerService的作用、dropbox 去重规则、Native Crash 流程，并从native crash角度整理下dropbox文件的来源。

NativeTombstoneManagerService

启动

NativeTombstoneManagerService在systemserver的startCoreServices函数中启动，主要的逻辑有下面2个方面：

在onStart方法中创建NativeTombstoneManager，负责监听“/data/tombstones"目录下文件（CREATE和MOVED_TO）变化，监听到文件变化时，调用handleTombstone函数处理；
在onBootPhase函数中调用onSystemReady函数，监听user和package变化，调用handleTombstone函数处理当前系统中的tombstone文件；

启动的时序图如下：

Dropbox过滤规则

设想一下如果一个应用一直崩溃，那么这个异常就要一直添加到dropbox中吗，当然不是，这里存在过滤规则，我们上面的流程图中也涉及到了，我们介绍下过滤规则

前置条件

过滤条件在类DropboxRateLimiter中，这个类的初始化中几个静态变量决定了我们过滤时的判断条件，我们整理如下，如果系统中未设置，都按默认值处理：

变量	默认值
mRateLimitBufferDuration	// After RATE_LIMIT_ALLOWED_ENTRIES have been collected (for a single breakdown of// process/eventType) further entries will be rejected until RATE_LIMIT_BUFFER_DURATION has// elapsed, after which the current count for this breakdown will be reset.private static final long RATE_LIMIT_BUFFER_DURATION_DEFAULT = 10 * DateUtils.MINUTE_IN_MILLIS;当某个进程/事件类型的分类维度（breakdown）下已收集到 `RATE_LIMIT_ALLOWED_ENTRIES` 条数据后，后续条目将被拒绝，直至经过 `RATE_LIMIT_BUFFER_DURATION` 时间间隔，此后该分类维度的当前计数将重置
mRateLimitBufferExpiryFactor	// Indicated how many buffer durations to wait before the rate limit buffer will be cleared.// E.g. if set to 3 will wait 3xRATE_LIMIT_BUFFER_DURATION before clearing the buffer.private static final long RATE_LIMIT_BUFFER_EXPIRY_FACTOR_DEFAULT = 3;表示需等待的缓冲时长倍数（用于确定何时清除速率限制缓冲区）。例如：若设为3，系统将在清除缓冲区前等待 `3 × RATE_LIMIT_BUFFER_DURATION` 时长
mRateLimitAllowedEntries	// The number of entries to keep per breakdown of process/eventType.private static final int RATE_LIMIT_ALLOWED_ENTRIES_DEFAULT = 6;在每个RATE_LIMIT_BUFFER_DURATION_DEFAULT 中，同一process/eventType 最多记录6次
mStrictRatelimitAllowedEntries	// If a process is rate limited twice in a row we consider it crash-looping and rate limit it more aggressively.private static final int STRICT_RATE_LIMIT_ALLOWED_ENTRIES_DEFAULT = 1;若某个进程连续两次触发速率限制，则判定其处于崩溃循环（crash-looping）状态，并对其施加更严格的速率限制策略
mStrictRateLimitBufferDuration	private static final long STRICT_RATE_LIMIT_BUFFER_DURATION_DEFAULT = 20 * DateUtils.MINUTE_IN_MILLIS;

过滤规则

当有tombstone文件生成的时候，回调handleTombstone方法，在函数addTombstoneToDropBox的处理中，会根据DropboxRateLimiter类中的shouldRateLimit方法进行过滤：

    /** Determines whether dropbox entries of a specific tag and process should be rate limited. */
    public RateLimitResult shouldRateLimit(String eventType, String processName) {
        // Rate-limit how often we're willing to do the heavy lifting to collect and record logs.
        final long now = mClock.uptimeMillis();
        synchronized (mErrorClusterRecords) {
            // Remove expired records if enough time has passed since the last cleanup.
            maybeRemoveExpiredRecords(now);

            ErrorRecord errRecord = mErrorClusterRecords.get(errorKey(eventType, processName));
            if (errRecord == null) {
                errRecord = new ErrorRecord(now, 1);
                mErrorClusterRecords.put(errorKey(eventType, processName), errRecord);
                return new RateLimitResult(false, 0);
            }

            final long timeSinceFirstError = now - errRecord.getStartTime();
            if (timeSinceFirstError > errRecord.getBufferDuration()) {
                final int errCount = recentlyDroppedCount(errRecord);
                errRecord.setStartTime(now);
                errRecord.setCount(1);

                // If this error happened exactly the next "rate limiting cycle" after the last
                // error and the previous cycle was rate limiting then increment the successive
                // rate limiting cycle counter. If a full "cycle" has passed since the last error
                // then this is no longer a continuous occurrence and will be rate limited normally.
                if (errCount > 0 && timeSinceFirstError < 2 * errRecord.getBufferDuration()) {
                    errRecord.incrementSuccessiveRateLimitCycles();
                } else {
                    errRecord.setSuccessiveRateLimitCycles(0);
                }

                return new RateLimitResult(false, errCount);
            }

            errRecord.incrementCount();
            if (errRecord.getCount() > errRecord.getAllowedEntries()) {
                return new RateLimitResult(true, recentlyDroppedCount(errRecord));
            }
        }
        return new RateLimitResult(false, 0);
    }

我们先要了解下mErrorClusterRecords和ErrorRecord这个类，mErrorClusterRecords是一个ArrayMap，key 是“eventType + processName”，value 是ErrorRecord

   @GuardedBy("mErrorClusterRecords")
    private final ArrayMap<String, ErrorRecord> mErrorClusterRecords = new ArrayMap<>();
    private class ErrorRecord {
        long mStartTime;
        int mCount;
        int mSuccessiveRateLimitCycles;

        ErrorRecord(long startTime, int count) {
            mStartTime = startTime;
            mCount = count;
            mSuccessiveRateLimitCycles = 0;
        }

下面我们通过表格的方式来整理一下shouldRateLimit过滤规则，进入函数之后先调用maybeRemoveExpiredRecords，对已经超过mRateLimitBufferExpiryFactor * mRateLimitBufferDuration（30分钟）的ErrorRecord进行移除：

条件	是否过滤
ErrorRecord errRecord = mErrorClusterRecords.get(errorKey(eventType, processName));if (errRecord == null) {errRecord = new ErrorRecord(now, 1);mErrorClusterRecords.put(errorKey(eventType, processName), errRecord);return new RateLimitResult(false, 0);}	errRecord 证明该“eventType + processName”最近30分钟从未发生过（当然也包括从未发生过）tombstone，此时创建ErrorRecord，异常发生时间记录为当前时间，次数记录为1, 并添加到mErrorClusterRecords，shouldRateLimit返回false（不过滤）
final long timeSinceFirstError = now - errRecord.getStartTime();if (timeSinceFirstError > errRecord.getBufferDuration()) {final int errCount = recentlyDroppedCount(errRecord);errRecord.setStartTime(now);errRecord.setCount(1);// If this error happened exactly the next "rate limiting cycle" after the last// error and the previous cycle was rate limiting then increment the successive// rate limiting cycle counter. If a full "cycle" has passed since the last error// then this is no longer a continuous occurrence and will be rate limited normally.if (errCount > 0 && timeSinceFirstError < 2 * errRecord.getBufferDuration()) {errRecord.incrementSuccessiveRateLimitCycles();} else {errRecord.setSuccessiveRateLimitCycles(0);}return new RateLimitResult(false, errCount);}	errRecord不为空，证明该“eventType + processName”最近30分钟以内发生过tombstone，获取它的发生时间，并和当前时间做差值；如果上次异常距离现在已经超过BufferDuration（10分钟或者20分钟，依赖isRepeated是否为true），通过recentlyDroppedCount函数获取errCount；如果次数小于getAllowedEntries（6次），errCount为0,否则为errRecord.getCount()-6；private int recentlyDroppedCount(ErrorRecord errRecord) {if (errRecord == null 或这 errRecord.getCount() < errRecord.getAllowedEntries()) return 0;return errRecord.getCount() - errRecord.getAllowedEntries();}如果errCount大于0,并且时间差值小于2倍的BufferDuration，通过incrementSuccessiveRateLimitCycles增加mSuccessiveRateLimitCycles值，如不满足上述条件mSuccessiveRateLimitCycles为0；shouldRateLimit返回false（不过滤）
errRecord.incrementCount();if (errRecord.getCount() > errRecord.getAllowedEntries()) {return new RateLimitResult(true, recentlyDroppedCount(errRecord));}	如果上述条件都没满足，即异常已经发生过，并且到现在的时间差值小于BufferDuration，并且已经发生了超过AllowedEntries（6次），shouldRateLimit返回true（过滤）

另一条通路

我们知道AMS在启动之后内部会创建NativeCrashListener，也会监听native crash，并将异常添加到dropbox，这是将native crash添加到dropbox的“另一条通路”。

linker

当一个 Android Native 程序（如 APK 中的 JNI 库或命令行工具）启动时，其 ELF 文件头部会指定动态链接器（即 linker）的路径。通过 PT_INTERP 段告知内核需要加载哪个解释器（interpreter）。当用户执可执行程序时，内核解析其 ELF 头，发现 PT_INTERP 段指向 /system/bin/linker64
内核将 linker 本身作为解释器加载到内存（mmap），并跳转到 linker 的入口点（__linker_init），而非直接执行 可执行程序 的 main 函数，在此流程中会给native 程序设置默认的异常信号处理的handler。
linker 负责完成动态库加载和重定位后，最终跳转回可执行文件的真实入口（main 函数）

debuggerd

linker在init函数中给可执行程序设置了默认信号处理函数debuggerd_signal_handler，当有异常信号发生时就走到处理逻辑中
debuggerd_signal_handler 首先打印异常信息log_signal_summary

919   919 F libc    : Fatal signal 11 (SIGSEGV), code 0 (SI_USER from pid 584, uid 0) in tid 919 (ndroid.settings), pid 919 (ndroid.settings)

然后通过exec的方式执行crash dump，进行异常信息的打印收集动作

#define CRASH_DUMP_NAME "crash_dump64"
#define CRASH_DUMP_PATH "/apex/com.android.runtime/bin/" CRASH_DUMP_NAME
    execle(CRASH_DUMP_PATH, CRASH_DUMP_NAME, main_tid, pseudothread_tid, debuggerd_dump_type,
           nullptr, nullptr);

crash_dump

crash dump 通过tombstoned_connect方式和tombstoned建立socket（tombstoned_crash）通信连接
建立连接之后调用tombstone的相关debug函数，打印并保存异常堆栈（libunwind）信息和寄存器信息
完成异常信息收集之后，发送socket信息给AMS，告知XXX发生了Native Crash
最后通过socket告知tombstoned该次crash dump已经处理完成

AMS

AMS 在systemready函数中启动NativeCrashListener，循环监听socket（/data/system/ndebugsocket）发来的事件
当有native crash发生时，crash_dump(64or32)在获取完寄存器信息和相关堆栈之后，会通过socket通知AMS
AMS监听到socket信息，执行consumeNativeCrashData函数，最终调用handleApplicationCrashInner("native_crash") 添加到dropbox

tombstoned

属于system group，开机默认启动，main函数中默认设置socket（tombstoned_crash）监听，连接回调函数crash_accept_cb

service tombstoned /system/bin/tombstoned
    user tombstoned
    group system

    socket tombstoned_crash seqpacket 0666 system system
    socket tombstoned_intercept seqpacket 0666 system system
    socket tombstoned_java_trace seqpacket 0666 system system
    writepid /dev/cpuset/system-background/tasks

当crash dump和tombstoned建立连接时，触发回调函数crash_accept_cb，并设置结束回调函数crash_completed_cb

217   217 I tombstoned: received crash request for pid 919

当crash dump结束dump动作时，tombstoned受到结束回调，写入tombstone文件

217   217 E tombstoned: Tombstone written to: tombstone_28

整个流程如下：

总结

基于Android 15源码分析，目前Android系统中关于native crash添加到dropbox中有2条通路：

NativeTombstoneManager中监听tombstone路径变化，handleTombstone函数将tombstone文件添加到dropbox中
AMS中通过NativeCrashListener监听native crash，最终添加到dropbox中

我们基于Android 15模拟器，启动一个demo应用，使用kill -11 pid的方式验证，我们看到tomstone路径下生成了两个文件：

emu64x:/data/tombstones # ls -al
total 772
drwxrwxr-x  2 system     system   4096 2025-05-28 14:52 .
drwxrwx--x 51 system     system   4096 2025-05-28 14:51 ..
-rw-rw-r--  1 tombstoned system 462008 2025-05-28 14:52 tombstone_00
-rw-rw-r--  1 tombstoned system 299220 2025-05-28 14:52 tombstone_00.pb

NativeTombstoneManager通过handleTombstone方法将这两个文件添加到dropbox中：

-rw-------  1 system system 14273 2025-05-28 14:52 SYSTEM_TOMBSTONE@1748415174157.txt.gz
-rw-------  1 system system 56406 2025-05-28 14:52 SYSTEM_TOMBSTONE_PROTO_WITH_HEADERS@1748415174142.dat.gz

AMS通过NativeCrashListener监听native crash，将下面这个文件添加到dropbox中

-rw-------  1 system system 16306 2025-05-28 14:52 data_app_native_crash@1748415174124.txt