本文分析基于Android R(11)
SIGSEGV是信号11,其在内存访问错误时产生。信号产生后需要送往用户空间进行处理,纯native的进程由debuggerd_signal_handler
进行处理,应用进程(zygote及其子进程)则由SignalChain::Hanler
进行处理。
和纯native进程相比,应用进程多了一层封装和分发,主要是为了检测Java世界的NPE(NullPointerException)和SOE(StackOverflowError)。众所周知,Java代码有两种执行模式,一种是解释执行,另一种是机器码执行。解释执行不会产生SIGSEGV,是因为每条指令的参数都可以在解释前进行判断,因此NPE和SOE可以在判断失败的时候抛出。而机器码执行是直接操作汇编指令,每一次的ldr/str不会有事先判断,因此可能产生SIGSEGV。
下面从源码的角度分别分析处理函数的注册和分发过程。
信号处理函数的注册
Android应用进程都是从zygote进程中fork出来的,因此每个信号的处理方式也继承于zygote。
zygote是在init fork出的子进程中通过执行app_process
可执行文件得到的,在执行app_process
可执行文件时,一般都以其中的main()函数作为我们程序的入口,但其实main只是我们程序逻辑上的入口。当exec
系统调用发生时,实际上会去调用/system/bin/linker64
的_start
入口,将链接器启动起来后,再去调用main函数。
ENTRY(_start)
// Force unwinds to end in this function.
.cfi_undefined x30
mov x0, sp
bl __linker_init
/* linker init returns the _entry address in the main image */
br x0
END(_start)
复制代码
__linker_init
中最终会调用linker_debuggerd_init()
,其中会将SIGSEGV的信号处理函数注册为debuggerd_signal_handler
。因此整个进程对于SIGSEGV的第一次注册发生在linker自举的过程中,它比app_process
的main函数执行时间还要早。
当zygote进程运行后,它需要启动ART虚拟机。在Runtime::Init
的时候会初始化全局变量fault_manager
,并注册NPE和SOE的处理函数。
// Dex2Oat's Runtime does not need the signal chain or the fault handler.
if (implicit_null_checks_ || implicit_so_checks_ || implicit_suspend_checks_) {
fault_manager.Init();
// These need to be in a specific order. The null point check handler must be
// after the suspend check and stack overflow check handlers.
//
// Note: the instances attach themselves to the fault manager and are handled by it. The
// manager will delete the instance on Shutdown().
if (implicit_suspend_checks_) {
new SuspensionHandler(&fault_manager);
}
if (implicit_so_checks_) {
new StackOverflowHandler(&fault_manager);
}
if (implicit_null_checks_) {
new NullPointerHandler(&fault_manager);
}
if (kEnableJavaStackTraceHandler) {
new JavaStackTraceHandler(&fault_manager);
}
}
复制代码
全局变量fault_manager
在Init的过程中会再次注册SIGSEGV,将SIGSEGV原有的注册函数debuggerd_signal_handledr
指针存入action_
字段,将SignalChain::Handler
注册为新的处理函数。
void Register(int signo) {
struct sigaction64 handler_action = {};
sigfillset64(&handler_action.sa_mask);
...
handler_action.sa_sigaction = SignalChain::Handler;
handler_action.sa_flags = SA_RESTART | SA_SIGINFO | SA_ONSTACK |
SA_UNSUPPORTED | SA_EXPOSE_TAGBITS;
linked_sigaction64(signo, &handler_action, &action_);
复制代码
上述代码有个地方需要注意,注册使用的函数是linked_sigaction64
,而不是sigaction64
。这是因为系统默认的sigaction64
是由libc
实现的,而libsigchain
中也实现了sigaction64
函数,它将libc
的sigaction64
记录为linked_sigaction64
,进而对libc
进行屏蔽。因此后续APP的动态库如果调用sigaction64
的话,都将进入libsigchain
中。
这么做的目的,是为了让APP动态库中的注册行为不影响NPE和SOE的检测。
下面是libsigchain
Android.bp文件中的代码,通过-z,global
的编译选项使得libc
的sigaction
符号被屏蔽。
// Make libsigchain symbols global, so that an app library which
// is loaded in a classloader linker namespace looks for
// libsigchain symbols before libc.
// -z,global marks the binary with the DF_1_GLOBAL flag which puts the symbols
// in the global group. It does not affect their visibilities like the version
// script does.
ldflags: ["-Wl,-z,global"],
复制代码
在目前的Android版本中,implicit_null_checks_
和implicit_so_checks_
默认打开,而implicit_suspend_checks_
和kEnableJavaStackTraceHandler
默认关闭。
StackOverflowHandler
和NullPointerHandler
都继承自FaultHandler
,它们在构造的时候便将自身的Action方法注册到generated_code_handlers_
数组中。譬如StackOverflowHandler
将StackOverflowHandler::Action
注册到数组中。之所以数组叫"generated code",是因为APK文件中最初只有dex文件,只有在手机中经过dex2oat才能生成机器码,因此生成的机器码这里又叫做"generated code"。
信号的分发
应用进程SIGSEGV的分发规则如下:
- SIGSEGV由
SignalChain::Handler
接收处理,同时传入fault pc和fault address的信息。 - 首先遍历
generated_code_handlers_
里所有的handler,这些handler是在虚拟机启动时注册的,一个用来抛出NPE,另一个用来抛出SOE。每个handler根据自己的判定规则,决定当前错误是否属于自己的类型,如果属于则抛出Java异常并结束分发过程,如果不属于则遍历下一个handler。 - 遍历
other_handlers_
,默认情况下这个handler数组为空。 - 调用
debuggerd_signal_handler
进行处理,处理的结果是生成一份tombstone文件,里面包含所有线程的调用栈信息,以及memory map等信息。
关于SignalChain的含义,虽然文中开头已经说明了,但更精准的表述可以参考源码中的注释。
// libsigchain provides an interception layer for signal handlers, to allow ART and others to give
// their signal handlers the first stab at handling signals before passing them on to user code.
//
// It implements wrapper functions for signal, sigaction, and sigprocmask, and a handler that
// forwards signals appropriately.
复制代码
在SignalChain::Handler
中,首先去遍历special_handlers_
的处理函数,接着再调用action_
字段存储的函数。
void SignalChain::Handler(int signo, siginfo_t* siginfo, void* ucontext_raw) {
// Try the special handlers first.
// If one of them crashes, we'll reenter this handler and pass that crash onto the user handler.
if (!GetHandlingSignal()) {
for (const auto& handler : chains[signo].special_handlers_) {
if (handler.sc_sigaction == nullptr) {
break;
}
sigset_t previous_mask;
linked_sigprocmask(SIG_SETMASK, &handler.sc_mask, &previous_mask);
ScopedHandlingSignal restorer;
SetHandlingSignal(true);
if (handler.sc_sigaction(signo, siginfo, ucontext_raw)) {
return;
}
linked_sigprocmask(SIG_SETMASK, &previous_mask, nullptr);
}
}
// Forward to the user's signal handler.
chains[signo].action_.sa_sigaction(signo, siginfo, ucontext_raw);
}
复制代码
在遍历special_handlers_
时,代码中有两点需要注意:
- 在调用处理函数前需要将mask改为
handler.sc_mask
,处理完后将mask恢复。对于art_fault_handler
而言,handler.sc_mask
设置如下。这样设置的目的是为了预防信号处理函数中再次产生信号的情况。
sigfillset(&mask);
sigdelset(&mask, SIGABRT);
sigdelset(&mask, SIGBUS);
sigdelset(&mask, SIGFPE);
sigdelset(&mask, SIGILL);
sigdelset(&mask, SIGSEGV);
复制代码
- 在调用处理函数之前需要
setHandlingSignal(true)
,配合1一起使用便可以在第二次进入SignalChain::Handler
时跳过art的处理。因为第二次进入往往意味着art的处理函数中出现了问题。
art_fault_handler
会调用FaultManager::HandleFault
函数。其中先判断fault pc是否属于Java编译生成的机器码,如果属于则进一步检测NPE和SOE,否则跳过generated_code_handlers_
直接遍历other_handlers_
。
bool FaultManager::HandleFault(int sig, siginfo_t* info, void* context) {
if (IsInGeneratedCode(info, context, true)) {
VLOG(signals) << "in generated code, looking for handler";
for (const auto& handler : generated_code_handlers_) {
VLOG(signals) << "invoking Action on handler " << handler;
if (handler->Action(sig, info, context)) {
// We have handled a signal so it's time to return from the
// signal handler to the appropriate place.
return true;
}
}
}
// We hit a signal we didn't handle. This might be something for which
// we can give more information about so call all registered handlers to
// see if it is.
if (HandleFaultByOtherHandlers(sig, info, context)) {
return true;
}
return false;
}
复制代码
IsInGeneratedCode
的检测过程如下,如果当前线程状态是Runnable,且持有mutator读写锁(表明可以操作Java堆),则基本可以证明此线程正在运行Java编译的机器码。之后根据Java栈的排列规则(栈顶存储ArtMethod对象)找到ArtMethod,判断fault pc是否在ArtMethod的指令范围内,如果在指令范围内,则进一步证明确实是generated code。
// This function is called within the signal handler. It checks that
// the mutator_lock is held (shared). No annotalysis is done.
bool FaultManager::IsInGeneratedCode(siginfo_t* siginfo, void* context, bool check_dex_pc) {
// We can only be running Java code in the current thread if it
// is in Runnable state.
Thread* thread = Thread::Current();
ThreadState state = thread->GetState();
if (state != kRunnable) {
VLOG(signals) << "not runnable";
return false;
}
// Current thread is runnable.
// Make sure it has the mutator lock.
if (!Locks::mutator_lock_->IsSharedHeld(thread)) {
VLOG(signals) << "no lock";
return false;
}
ArtMethod* method_obj = nullptr;
uintptr_t return_pc = 0;
uintptr_t sp = 0;
bool is_stack_overflow = false;
// Get the architecture specific method address and return address. These
// are in architecture specific files in arch/<arch>/fault_handler_<arch>.
GetMethodAndReturnPcAndSp(siginfo, context, &method_obj, &return_pc, &sp, &is_stack_overflow);
const OatQuickMethodHeader* method_header = method_obj->GetOatQuickMethodHeader(return_pc); //如果pc不在ArtMethod范围内,则返回nullptr
if (method_header == nullptr) {
VLOG(signals) << "no compiled code";
return false;
}
dexpc = method_header->ToDexPc(reinterpret_cast<ArtMethod**>(sp), return_pc, false);
return !check_dex_pc || dexpc != dex::kDexNoIndex;
}
复制代码
之后分别介绍NPE和SOE具体的检测规则。
NullPointerException的检测规则
NullPointerException的检测需要调用到NullPointerHandler::Action
函数。
bool NullPointerHandler::Action(int sig ATTRIBUTE_UNUSED, siginfo_t* info, void* context) {
if (!IsValidImplicitCheck(info)) {
return false;
}
// The code that looks for the catch location needs to know the value of the
// PC at the point of call. For Null checks we insert a GC map that is immediately after
// the load/store instruction that might cause the fault.
struct ucontext *uc = reinterpret_cast<struct ucontext*>(context);
struct sigcontext *sc = reinterpret_cast<struct sigcontext*>(&uc->uc_mcontext);
// Push the gc map location to the stack and pass the fault address in LR.
sc->sp -= sizeof(uintptr_t);
*reinterpret_cast<uintptr_t*>(sc->sp) = sc->pc + 4;
sc->regs[30] = reinterpret_cast<uintptr_t>(info->si_addr);
sc->pc = reinterpret_cast<uintptr_t>(art_quick_throw_null_pointer_exception_from_signal);
VLOG(signals) << "Generating null pointer exception";
return true;
}
复制代码
检测需要经由IsValidImplicitCheck
判断,该函数的判断逻辑很简单,即fault address是否小于1页。为什么是小于1页,而不是等于0呢?原因是很多时候我们访问的是一个对象的字段或vtable,而不是对象本身。不论是字段还是vtable,它们相对于对象的起始地址都存在偏移,如果对象起始地址为0,则最终内存访问的就是一个很小的偏移值。
static bool IsValidImplicitCheck(siginfo_t* siginfo) {
// Our implicit NPE checks always limit the range to a page.
// Note that the runtime will do more exhaustive checks (that we cannot
// reasonably do in signal processing code) based on the dex instruction
// faulting.
return CanDoImplicitNullCheckOn(reinterpret_cast<uintptr_t>(siginfo->si_addr));
}
复制代码
// Returns whether the given memory offset can be used for generating
// an implicit null check.
static inline bool CanDoImplicitNullCheckOn(uintptr_t offset) {
return offset < kPageSize;
}
复制代码
判定为NPE后,NullPointerHandler::Action
会修改原始上下文的pc值。当前我们正处于信号处理函数中,当我们从函数返回时,默认情况下程序会重新执行"错误"指令。但如果我们在其中修改了原始上下文的pc值,那么函数返回后将会跳转到pc指定的位置。
art_quick_throw_null_pointer_exception_from_signal
会做两件事,我们在"异常如何抛出"小节中再做详解。这里先简单罗列下。
- 生成Java层的NullPointerException对象。
- 跳转到可以捕获该异常的catch代码块中。
StackOverflowError的检测规则
在介绍SOE的检测规则之前,得先了解ART中栈的结构。
栈的最顶部有两页是无法读写的,一旦读写就会发生内存错误。另外栈的动态增长是在函数中完成的,因此检测必须要和函数调用结合起来。在AArch64架构中,每次函数调用时都会执行以下汇编指令,将0值写入sp-0x2000的位置。如果栈中可用空间大于2页,则sp-0x2000仍然落在可读写范围内;但如果可用空间小于2页,那么sp-0x2000将落到不可读写的红色区域。一旦往一块不可读写的区域写入数据,既会引发SIGSEGV。
sub x16, sp, #0x2000 (8192)
ldr wzr, [x16]
复制代码
因此实际的检测就是判断sp-0x2000和fault address是否相等,如果相等,则证明这个SIGSEGV是由上述代码产生的,也即SOE实际地发生了。
bool StackOverflowHandler::Action(int sig ATTRIBUTE_UNUSED, siginfo_t* info ATTRIBUTE_UNUSED,
void* context) {
struct ucontext *uc = reinterpret_cast<struct ucontext *>(context);
struct sigcontext *sc = reinterpret_cast<struct sigcontext*>(&uc->uc_mcontext);
VLOG(signals) << "stack overflow handler with sp at " << std::hex << &uc;
VLOG(signals) << "sigcontext: " << std::hex << sc;
uintptr_t sp = sc->sp;
VLOG(signals) << "sp: " << std::hex << sp;
uintptr_t fault_addr = sc->fault_address;
VLOG(signals) << "fault_addr: " << std::hex << fault_addr;
VLOG(signals) << "checking for stack overflow, sp: " << std::hex << sp <<
", fault_addr: " << fault_addr;
uintptr_t overflow_addr = sp - GetStackOverflowReservedBytes(InstructionSet::kArm64); // sp - 0x2000
// Check that the fault address is the value expected for a stack overflow.
if (fault_addr != overflow_addr) {
VLOG(signals) << "Not a stack overflow";
return false;
}
VLOG(signals) << "Stack overflow found";
// Now arrange for the signal handler to return to art_quick_throw_stack_overflow.
// The value of LR must be the same as it was when we entered the code that
// caused this fault. This will be inserted into a callee save frame by
// the function to which this handler returns (art_quick_throw_stack_overflow).
sc->pc = reinterpret_cast<uintptr_t>(art_quick_throw_stack_overflow);
// The kernel will now return to the address in sc->pc.
return true;
}
复制代码
如果SOE判断通过后,处理函数返回后将会执行art_quick_throw_stack_overflow
。
异常如何抛出
NPE检测通过后执行如下代码。
ENTRY art_quick_throw_null_pointer_exception_from_signal
// The fault handler pushes the gc map address, i.e. "return address", to stack
// and passes the fault address in LR. So we need to set up the CFI info accordingly.
.cfi_def_cfa_offset __SIZEOF_POINTER__
.cfi_rel_offset lr, 0
// Save all registers as basis for long jump context.
INCREASE_FRAME (FRAME_SIZE_SAVE_EVERYTHING - __SIZEOF_POINTER__)
SAVE_REG x29, (FRAME_SIZE_SAVE_EVERYTHING - 2 * __SIZEOF_POINTER__) // LR already saved.
SETUP_SAVE_EVERYTHING_FRAME_DECREMENTED_SP_SKIP_X29_LR
mov x0, lr // pass the fault address stored in LR by the fault handler.
mov x1, xSELF // pass Thread::Current.
bl artThrowNullPointerExceptionFromSignal // (arg, Thread*).
brk 0
END art_quick_throw_null_pointer_exception_from_signal
复制代码
extern "C" NO_RETURN void artThrowNullPointerExceptionFromSignal(uintptr_t addr, Thread* self)
REQUIRES_SHARED(Locks::mutator_lock_) {
ScopedQuickEntrypointChecks sqec(self);
ThrowNullPointerExceptionFromDexPC(/* check_address= */ true, addr);
self->QuickDeliverException();
}
复制代码
SOE检测通过后最终执行如下代码。
extern "C" NO_RETURN void artThrowStackOverflowFromCode(Thread* self)
REQUIRES_SHARED(Locks::mutator_lock_) {
ScopedQuickEntrypointChecks sqec(self);
ThrowStackOverflowError(self);
self->QuickDeliverException();
}
复制代码
ThrowNullPointerExceptionFromDexPC
和ThrowStackOverflowError
的作用都是构造Java世界的Throwable对象,只不过一个构造的是NullPointerException,另一个是StackOverflowError。构造完的Throwable对象有两个关键的信息,一个是提示字符串,另一个是调用栈。构造的对象会存入thread->tlsPtr_.exception
字段,这样线程的其他地方都可以取到它。
接下来重点分析QuickDeliverException
函数。它的功能是跳转到对应的catch代码块中去。
void Thread::QuickDeliverException() {
// Get exception from thread.
ObjPtr<mirror::Throwable> exception = GetException();
// Don't leave exception visible while we try to find the handler, which may cause class
// resolution.
ClearException();
QuickExceptionHandler exception_handler(this, false);
exception_handler.FindCatch(exception);
exception_handler.DoLongJump();
}
复制代码
首先通过FindCatch
找到两个信息。
- 可以捕获该异常的catch代码块所在的那一帧,记录下该帧的sp。
- 可以捕获该异常的catch代码块的起始地址,记录下机器码的起始地址pc或字节码的地址dex_pc。
之后通过DoLongJump
跳转过去。
void QuickExceptionHandler::DoLongJump(bool smash_caller_saves) {
// Place context back on thread so it will be available when we continue.
self_->ReleaseLongJumpContext(context_);
context_->SetSP(reinterpret_cast<uintptr_t>(handler_quick_frame_));
CHECK_NE(handler_quick_frame_pc_, 0u);
context_->SetPC(handler_quick_frame_pc_);
context_->SetArg0(handler_quick_arg0_);
if (smash_caller_saves) {
context_->SmashCallerSaves();
}
if (!is_deoptimization_ &&
handler_method_header_ != nullptr &&
handler_method_header_->IsNterpMethodHeader()) {
context_->SetNterpDexPC(reinterpret_cast<uintptr_t>(
GetHandlerMethod()->DexInstructions().Insns() + handler_dex_pc_));
}
context_->DoLongJump();
UNREACHABLE();
}
复制代码
首先将那一帧的帧地址存入SP字段中,接着将机器码地址存入PC字段中。如果该帧由解释器执行,则机器码地址指向一个跳板(trampoline)函数,而真正的字节码地址dex_pc将存入x22字段,最终会在解释器执行时取出。
接着将所有字段写入实际的寄存器中,然后通过br
指令头也不回地跳到catch代码块中去。
ENTRY art_quick_do_long_jump
// Load FPRs
ldp d0, d1, [x1, #0]
ldp d2, d3, [x1, #16]
ldp d4, d5, [x1, #32]
ldp d6, d7, [x1, #48]
ldp d8, d9, [x1, #64]
ldp d10, d11, [x1, #80]
ldp d12, d13, [x1, #96]
ldp d14, d15, [x1, #112]
ldp d16, d17, [x1, #128]
ldp d18, d19, [x1, #144]
ldp d20, d21, [x1, #160]
ldp d22, d23, [x1, #176]
ldp d24, d25, [x1, #192]
ldp d26, d27, [x1, #208]
ldp d28, d29, [x1, #224]
ldp d30, d31, [x1, #240]
// Load GPRs. Delay loading x0, x1 because x0 is used as gprs_.
ldp x2, x3, [x0, #16]
ldp x4, x5, [x0, #32]
ldp x6, x7, [x0, #48]
ldp x8, x9, [x0, #64]
ldp x10, x11, [x0, #80]
ldp x12, x13, [x0, #96]
ldp x14, x15, [x0, #112]
// Do not load IP0 (x16) and IP1 (x17), these shall be clobbered below.
// Don't load the platform register (x18) either.
ldr x19, [x0, #152] // xSELF.
ldp x20, x21, [x0, #160] // For Baker RB, wMR (w20) is reloaded below.
ldp x22, x23, [x0, #176]
ldp x24, x25, [x0, #192]
ldp x26, x27, [x0, #208]
ldp x28, x29, [x0, #224]
ldp x30, xIP0, [x0, #240] // LR and SP, load SP to IP0.
// Load PC to IP1, it's at the end (after the space for the unused XZR).
ldr xIP1, [x0, #33*8]
// Load x0, x1.
ldp x0, x1, [x0, #0]
// Set SP. Do not access fprs_ and gprs_ from now, they are below SP.
mov sp, xIP0
REFRESH_MARKING_REGISTER
br xIP1
END art_quick_do_long_jump
复制代码