前言
我们基于Android 15的源码分析一下NativeTombstoneManagerService的作用、dropbox 去重规则、Native Crash 流程,并从native crash角度整理下dropbox文件的来源。
NativeTombstoneManagerService
启动
NativeTombstoneManagerService在systemserver的startCoreServices函数中启动,主要的逻辑有下面2个方面:
- 在onStart方法中创建NativeTombstoneManager,负责监听“/data/tombstones"目录下文件(CREATE和MOVED_TO)变化,监听到文件变化时,调用handleTombstone函数处理;
- 在onBootPhase函数中调用onSystemReady函数,监听user和package变化,调用handleTombstone函数处理当前系统中的tombstone文件;
启动的时序图如下:
Dropbox过滤规则
设想一下如果一个应用一直崩溃,那么这个异常就要一直添加到dropbox中吗,当然不是,这里存在过滤规则 ,我们上面的流程图中也涉及到了,我们介绍下过滤规则
前置条件
过滤条件在类DropboxRateLimiter中,这个类的初始化中几个静态变量决定了我们过滤时的判断条件,我们整理如下,如果系统中未设置,都按默认值处理:
| 变量 | 默认值 |
|---|---|
| mRateLimitBufferDuration | // After RATE_LIMIT_ALLOWED_ENTRIES have been collected (for a single breakdown of// process/eventType) further entries will be rejected until RATE_LIMIT_BUFFER_DURATION has// elapsed, after which the current count for this breakdown will be reset.private static final long RATE_LIMIT_BUFFER_DURATION_DEFAULT = 10 * DateUtils.MINUTE_IN_MILLIS;当某个进程/事件类型的分类维度(breakdown)下已收集到 RATE_LIMIT_ALLOWED_ENTRIES 条数据后,后续条目将被拒绝,直至经过 RATE_LIMIT_BUFFER_DURATION 时间间隔,此后该分类维度的当前计数将重置 |
| mRateLimitBufferExpiryFactor | // Indicated how many buffer durations to wait before the rate limit buffer will be cleared.// E.g. if set to 3 will wait 3xRATE_LIMIT_BUFFER_DURATION before clearing the buffer.private static final long RATE_LIMIT_BUFFER_EXPIRY_FACTOR_DEFAULT = 3;表示需等待的 缓冲时长倍数(用于确定何时清除速率限制缓冲区)。例如:若设为3,系统将在清除缓冲区前等待 3 × RATE_LIMIT_BUFFER_DURATION 时长 |
| mRateLimitAllowedEntries | // The number of entries to keep per breakdown of process/eventType.private static final int RATE_LIMIT_ALLOWED_ENTRIES_DEFAULT = 6;在每个RATE_LIMIT_BUFFER_DURATION_DEFAULT 中,同一process/eventType 最多记录6次 |
| mStrictRatelimitAllowedEntries | // If a process is rate limited twice in a row we consider it crash-looping and rate limit it more aggressively.private static final int STRICT_RATE_LIMIT_ALLOWED_ENTRIES_DEFAULT = 1;若某个进程连续两次触发速率限制,则判定其处于崩溃循环(crash-looping)状态,并对其施加更严格的速率限制策略 |
| mStrictRateLimitBufferDuration | private static final long STRICT_RATE_LIMIT_BUFFER_DURATION_DEFAULT = 20 * DateUtils.MINUTE_IN_MILLIS; |
过滤规则
当有tombstone文件生成的时候,回调handleTombstone方法,在函数addTombstoneToDropBox的处理中,会根据DropboxRateLimiter类中的shouldRateLimit方法进行过滤:
/** Determines whether dropbox entries of a specific tag and process should be rate limited. */
public RateLimitResult shouldRateLimit(String eventType, String processName) {
// Rate-limit how often we're willing to do the heavy lifting to collect and record logs.
final long now = mClock.uptimeMillis();
synchronized (mErrorClusterRecords) {
// Remove expired records if enough time has passed since the last cleanup.
maybeRemoveExpiredRecords(now);
ErrorRecord errRecord = mErrorClusterRecords.get(errorKey(eventType, processName));
if (errRecord == null) {
errRecord = new ErrorRecord(now, 1);
mErrorClusterRecords.put(errorKey(eventType, processName), errRecord);
return new RateLimitResult(false, 0);
}
final long timeSinceFirstError = now - errRecord.getStartTime();
if (timeSinceFirstError > errRecord.getBufferDuration()) {
final int errCount = recentlyDroppedCount(errRecord);
errRecord.setStartTime(now);
errRecord.setCount(1);
// If this error happened exactly the next "rate limiting cycle" after the last
// error and the previous cycle was rate limiting then increment the successive
// rate limiting cycle counter. If a full "cycle" has passed since the last error
// then this is no longer a continuous occurrence and will be rate limited normally.
if (errCount > 0 && timeSinceFirstError < 2 * errRecord.getBufferDuration()) {
errRecord.incrementSuccessiveRateLimitCycles();
} else {
errRecord.setSuccessiveRateLimitCycles(0);
}
return new RateLimitResult(false, errCount);
}
errRecord.incrementCount();
if (errRecord.getCount() > errRecord.getAllowedEntries()) {
return new RateLimitResult(true, recentlyDroppedCount(errRecord));
}
}
return new RateLimitResult(false, 0);
}
我们先要了解下mErrorClusterRecords和ErrorRecord这个类,mErrorClusterRecords是一个ArrayMap,key 是“eventType + processName”,value 是ErrorRecord
@GuardedBy("mErrorClusterRecords")
private final ArrayMap<String, ErrorRecord> mErrorClusterRecords = new ArrayMap<>();
private class ErrorRecord {
long mStartTime;
int mCount;
int mSuccessiveRateLimitCycles;
ErrorRecord(long startTime, int count) {
mStartTime = startTime;
mCount = count;
mSuccessiveRateLimitCycles = 0;
}
下面我们通过表格的方式来整理一下shouldRateLimit过滤规则,进入函数之后先调用maybeRemoveExpiredRecords,对已经超过mRateLimitBufferExpiryFactor * mRateLimitBufferDuration(30分钟)的ErrorRecord进行移除:
| 条件 | 是否过滤 |
|---|---|
| ErrorRecord errRecord = mErrorClusterRecords.get(errorKey(eventType, processName));if (errRecord == null) {errRecord = new ErrorRecord(now, 1);mErrorClusterRecords.put(errorKey(eventType, processName), errRecord);return new RateLimitResult(false, 0);} | errRecord 证明该“eventType + processName”最近30分钟从未发生过(当然也包括从未发生过)tombstone,此时创建ErrorRecord,异常发生时间记录为当前时间,次数记录为1, 并添加到mErrorClusterRecords,shouldRateLimit返回false(不过滤) |
| final long timeSinceFirstError = now - errRecord.getStartTime();if (timeSinceFirstError > errRecord.getBufferDuration()) {final int errCount = recentlyDroppedCount(errRecord);errRecord.setStartTime(now);errRecord.setCount(1);// If this error happened exactly the next "rate limiting cycle" after the last// error and the previous cycle was rate limiting then increment the successive// rate limiting cycle counter. If a full "cycle" has passed since the last error// then this is no longer a continuous occurrence and will be rate limited normally.if (errCount > 0 && timeSinceFirstError < 2 * errRecord.getBufferDuration()) {errRecord.incrementSuccessiveRateLimitCycles();} else {errRecord.setSuccessiveRateLimitCycles(0);}return new RateLimitResult(false, errCount);} | errRecord不为空,证明该“eventType + processName”最近30分钟以内发生过tombstone,获取它的发生时间,并和当前时间做差值;如果上次异常距离现在已经超过BufferDuration(10分钟或者20分钟,依赖isRepeated是否为true),通过recentlyDroppedCount函数获取errCount;如果次数小于getAllowedEntries(6次),errCount为0,否则为errRecord.getCount()-6;private int recentlyDroppedCount(ErrorRecord errRecord) {if (errRecord == null 或这 errRecord.getCount() < errRecord.getAllowedEntries()) return 0;return errRecord.getCount() - errRecord.getAllowedEntries();}如果errCount大于0,并且时间差值小于2倍的BufferDuration,通过incrementSuccessiveRateLimitCycles增加mSuccessiveRateLimitCycles值,如不满足上述条件mSuccessiveRateLimitCycles为0;shouldRateLimit返回false(不过滤) |
| errRecord.incrementCount();if (errRecord.getCount() > errRecord.getAllowedEntries()) {return new RateLimitResult(true, recentlyDroppedCount(errRecord));} | 如果上述条件都没满足,即异常已经发生过,并且到现在的时间差值小于BufferDuration,并且已经发生了超过AllowedEntries(6次),shouldRateLimit返回true(过滤) |
另一条通路
我们知道AMS在启动之后内部会创建NativeCrashListener,也会监听native crash,并将异常添加到dropbox,这是将native crash添加到dropbox的“另一条通路”。
linker
- 当一个 Android Native 程序(如 APK 中的 JNI 库或命令行工具)启动时,其 ELF 文件头部 会指定动态链接器(即 linker)的路径。通过 PT_INTERP 段 告知内核需要加载哪个解释器(interpreter)。当用户执可执行程序时,内核解析其 ELF 头,发现
PT_INTERP段指向/system/bin/linker64 - 内核将 linker 本身 作为解释器加载到内存(
mmap),并跳转到 linker 的入口点(__linker_init),而非直接执行可执行程序的main函数,在此流程中会给native 程序设置默认的异常信号处理的handler。 - linker 负责完成动态库加载和重定位后,最终跳转回可执行文件的真实入口(
main函数)
debuggerd
- linker在init函数中给可执行程序设置了默认信号处理函数debuggerd_signal_handler,当有异常信号发生时就走到处理逻辑中
- debuggerd_signal_handler 首先打印异常信息log_signal_summary
919 919 F libc : Fatal signal 11 (SIGSEGV), code 0 (SI_USER from pid 584, uid 0) in tid 919 (ndroid.settings), pid 919 (ndroid.settings)
- 然后通过exec的方式执行crash dump,进行异常信息的打印收集动作
#define CRASH_DUMP_NAME "crash_dump64"
#define CRASH_DUMP_PATH "/apex/com.android.runtime/bin/" CRASH_DUMP_NAME
execle(CRASH_DUMP_PATH, CRASH_DUMP_NAME, main_tid, pseudothread_tid, debuggerd_dump_type,
nullptr, nullptr);
crash_dump
- crash dump 通过tombstoned_connect方式和tombstoned建立socket(tombstoned_crash)通信连接
- 建立连接之后调用tombstone的相关debug函数,打印并保存异常堆栈(libunwind)信息和寄存器信息
- 完成异常信息收集之后,发送socket信息给AMS,告知XXX发生了Native Crash
- 最后通过socket告知tombstoned该次crash dump已经处理完成
AMS
- AMS 在systemready函数中启动NativeCrashListener,循环监听socket(/data/system/ndebugsocket)发来的事件
- 当有native crash发生时,crash_dump(64or32)在获取完寄存器信息和相关堆栈之后,会通过socket通知AMS
- AMS监听到socket信息,执行consumeNativeCrashData函数,最终调用handleApplicationCrashInner("native_crash") 添加到dropbox
tombstoned
- 属于system group,开机默认启动,main函数中默认设置socket(tombstoned_crash)监听,连接回调函数crash_accept_cb
service tombstoned /system/bin/tombstoned
user tombstoned
group system
socket tombstoned_crash seqpacket 0666 system system
socket tombstoned_intercept seqpacket 0666 system system
socket tombstoned_java_trace seqpacket 0666 system system
writepid /dev/cpuset/system-background/tasks
- 当crash dump和tombstoned建立连接时,触发回调函数crash_accept_cb,并设置结束回调函数crash_completed_cb
217 217 I tombstoned: received crash request for pid 919
- 当crash dump结束dump动作时,tombstoned受到结束回调,写入tombstone文件
217 217 E tombstoned: Tombstone written to: tombstone_28
整个流程如下:
总结
基于Android 15源码分析,目前Android系统中关于native crash添加到dropbox中有2条通路:
- NativeTombstoneManager中监听tombstone路径变化,handleTombstone函数将tombstone文件添加到dropbox中
- AMS中通过NativeCrashListener监听native crash,最终添加到dropbox中
我们基于Android 15模拟器,启动一个demo应用,使用kill -11 pid的方式验证,我们看到tomstone路径下生成了两个文件:
emu64x:/data/tombstones # ls -al
total 772
drwxrwxr-x 2 system system 4096 2025-05-28 14:52 .
drwxrwx--x 51 system system 4096 2025-05-28 14:51 ..
-rw-rw-r-- 1 tombstoned system 462008 2025-05-28 14:52 tombstone_00
-rw-rw-r-- 1 tombstoned system 299220 2025-05-28 14:52 tombstone_00.pb
NativeTombstoneManager通过handleTombstone方法将这两个文件添加到dropbox中:
-rw------- 1 system system 14273 2025-05-28 14:52 SYSTEM_TOMBSTONE@1748415174157.txt.gz
-rw------- 1 system system 56406 2025-05-28 14:52 SYSTEM_TOMBSTONE_PROTO_WITH_HEADERS@1748415174142.dat.gz
AMS通过NativeCrashListener监听native crash,将下面这个文件添加到dropbox中
-rw------- 1 system system 16306 2025-05-28 14:52 data_app_native_crash@1748415174124.txt