一、背景
- 背景:本文用于介绍Android 12 上新增的System Server FD check机制,本文基于android 14源码分析
- 功能定位:用于轮训检查SystemServer fd 使用情况,一旦到达预警限制,便会触发响应的动作,这里包括抓取heaprof,主动abort等
二、原理
- 原理:通过open /dev/null 返回的fd编号,来判断fd是否超过阈值来触发相应的动作,这里需要理解进程的fd编号是从0开始增长的,当你打开一个文件时,此时的fd就是当前的最大值
三、代码介绍
- SystemServer的run方法中,判断是否debug版本,debug版本使能
// Debug builds - spawn a thread to monitor for fd leaks.
if (Build.IS_DEBUGGABLE) {
spawnFdLeakCheckThread();
}
- spawnFdLeakCheckThread方法就是启动一个线程轮训检查systemserver的fd 数量是否达到预警值,系统默认的预警值如下:
final int enableThreshold = SystemProperties.getInt(SYSPROP_FDTRACK_ENABLE_THRESHOLD, 1600);
final int abortThreshold = SystemProperties.getInt(SYSPROP_FDTRACK_ABORT_THRESHOLD, 3000);
final int checkInterval = SystemProperties.getInt(SYSPROP_FDTRACK_INTERVAL, 120);
- 获取当前fd数量的方法如原理中介绍,通过打开/dev/null 返回的fd来判断
private static int getMaxFd() {
FileDescriptor fd = null;
try {
fd = Os.open("/dev/null", O_RDONLY | O_CLOEXEC, 0);
return fd.getInt$();
} catch (ErrnoException ex) {
Slog.e("System", "Failed to get maximum fd: " + ex);
} finally {
if (fd != null) {
try {
Os.close(fd);
} catch (ErrnoException ex) {
// If Os.close threw, something went horribly wrong.
throw new RuntimeException(ex);
}
}
}
- 当fd的数量大于enableThreshold(默认1600)时,执行gc 更新maxfd为gc之后获取的maxfd
- 当fd的数量首次大于enableThreshold时,执行loadLibrary("fdtrack"),该动作只执行一次,这是为了后面dump fd相关的信息,需要利用fdtrack.so
- 当fd的数量大于abortThreshold(默认3000)时,抓取hprof和主动abort,会生成tombstone,并列举fd信息
while (true) {
int maxFd = getMaxFd();
if (maxFd > enableThreshold) {
// Do a manual GC to clean up fds that are hanging around as garbage.
System.gc();
System.runFinalization();
maxFd = getMaxFd();
}
if (maxFd > enableThreshold && !enabled) {
Slog.i("System", "fdtrack enable threshold reached, enabling");
FrameworkStatsLog.write(FrameworkStatsLog.FDTRACK_EVENT_OCCURRED,
FrameworkStatsLog.FDTRACK_EVENT_OCCURRED__EVENT__ENABLED,
maxFd);
System.loadLibrary("fdtrack");
enabled = true;
} else if (maxFd > abortThreshold) {
Slog.i("System", "fdtrack abort threshold reached, dumping and aborting");
FrameworkStatsLog.write(FrameworkStatsLog.FDTRACK_EVENT_OCCURRED,
FrameworkStatsLog.FDTRACK_EVENT_OCCURRED__EVENT__ABORTING,
maxFd);
dumpHprof();
fdtrackAbort();
} else {
// Limit this to once per hour.
long now = SystemClock.elapsedRealtime();
if (now > nextWrite) {
nextWrite = now + 60 * 60 * 1000;
FrameworkStatsLog.write(FrameworkStatsLog.FDTRACK_EVENT_OCCURRED,
enabled ? FrameworkStatsLog.FDTRACK_EVENT_OCCURRED__EVENT__ENABLED
: FrameworkStatsLog.FDTRACK_EVENT_OCCURRED__EVENT__DISABLED,
maxFd);
}
}
try {
Thread.sleep(checkInterval * 1000);
} catch (InterruptedException ex) {
continue;
}
}
}).start();
- dumpHprof的函数实现如下,从代码逻辑看仅保留2个,heapdump文件较大,也是为了节省空间
private static final File HEAP_DUMP_PATH = new File("/data/system/heapdump/");
private static void dumpHprof() {
// hprof dumps are rather large, so ensure we don't fill the disk by generating
// hundreds of these that will live forever.
TreeSet<File> existingTombstones = new TreeSet<>();
for (File file : HEAP_DUMP_PATH.listFiles()) {
if (!file.isFile()) {
continue;
}
if (!file.getName().startsWith("fdtrack-")) {
continue;
}
existingTombstones.add(file);
}
if (existingTombstones.size() >= MAX_HEAP_DUMPS) {
for (int i = 0; i < MAX_HEAP_DUMPS - 1; ++i) {
// Leave the newest `MAX_HEAP_DUMPS - 1` tombstones in place.
existingTombstones.pollLast();
}
for (File file : existingTombstones) {
if (!file.delete()) {
Slog.w("System", "Failed to clean up hprof " + file);
}
}
}
try {
String date = new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss").format(new Date());
String filename = "/data/system/heapdump/fdtrack-" + date + ".hprof";
Debug.dumpHprofData(filename);
} catch (IOException ex) {
Slog.e("System", "Failed to dump fdtrack hprof", ex);
}
}
- fdtrackAbort的实现如下,就是发送信号BIONIC_SIGNAL_FDTRACK,并且附带val.sival_int = 1;
static void android_server_SystemServer_fdtrackAbort(JNIEnv*, jobject) {
sigval val;
val.sival_int = 1;
sigqueue(getpid(), BIONIC_SIGNAL_FDTRACK, val);
}
- BIONIC_SIGNAL_FDTRACK信号是在dlopen fdtrack so时注册的,并且指定action
__attribute__((constructor)) static void ctor() {
for (auto& entry : stack_traces) {
entry.backtrace.reserve(kStackDepth);
}
struct sigaction sa = {};
sa.sa_sigaction = [](int, siginfo_t* siginfo, void*) {
if (siginfo->si_code == SI_QUEUE && siginfo->si_int == 1) {
fdtrack_dump_fatal();
} else {
fdtrack_dump();
}
};
sa.sa_flags = SA_SIGINFO | SA_ONSTACK;
sigaction (BIONIC_SIGNAL_FDTRACK, &sa, nullptr );
if (Maps().Parse()) {
ProcessMemory() = unwindstack::Memory::CreateProcessMemoryThreadCached(getpid());
android_fdtrack_hook_t expected = nullptr;
installed = android_fdtrack_compare_exchange_hook(&expected, &fd_hook);
}
android_fdtrack_set_globally_enabled(true);
- 收到BIONIC_SIGNAL_FDTRACK信号时执行相应的动作,这里我们附带了si_int =1 ,所以需要执行fdtrack_dump_fatal,这里会dump fd 并主动触发abort,将信息生成到tobstone,这里就不再展开
if (siginfo->si_code == SI_QUEUE && siginfo->si_int == 1) {
fdtrack_dump_fatal ();
} else {
fdtrack_dump();
}
四、思考
从之前的经验来看,system server 出现fd leak时,fdtrack打印的信息通常不够,比如当系统出现了window 泄漏,我们知道每个window都有一对socket,这都是fd,当出现这种情况时,fdtrack打印的socket很多,但是我们无法定位到原因,因此建议当fd leak发生时增加了dump window机制,结合fdtrack 事半功倍。