讨论 | 基于FP的栈回溯对于主线程的特殊处理上周和一个字节的哥们讨论GWP-ASan，从他那里学到一个之前不了解的知识

本文分析基于Android 5-11

上周和一个字节的哥们讨论GWP-ASan，从他那里学到一个之前不了解的知识点。因此我花了一些时间研究，成文在此，供需要的朋友参考。

Native栈回溯的方案有好几种，而速度最快的无疑是FP的方案，它利用寄存器的压栈规律可以快速找出每一帧的返回地址。因此GWP-ASan和字节的MemCorruption工具在64位的环境里都用了这种方案。

我们以GWP-ASan为例，它使用了如下函数（Android 11中新增的函数）来收集每一帧的返回地址。

__attribute__((no_sanitize("address", "hwaddress"))) size_t android_unsafe_frame_pointer_chase(
    uintptr_t* buf, size_t num_entries) {
  // Disable MTE checks for the duration of this function, since we can't be sure that following
  // next_frame pointers won't cause us to read from tagged memory. ASAN/HWASAN are disabled here
  // for the same reason.
  ScopedDisableMTE x;

  struct frame_record {
    uintptr_t next_frame, return_addr;
  };

  auto begin = reinterpret_cast<uintptr_t>(__builtin_frame_address(0));
  uintptr_t end = __get_thread()->stack_top;

  stack_t ss;
  if (sigaltstack(nullptr, &ss) == 0 && (ss.ss_flags & SS_ONSTACK)) {
    end = reinterpret_cast<uintptr_t>(ss.ss_sp) + ss.ss_size;
  }

  size_t num_frames = 0;
  while (1) {
    auto* frame = reinterpret_cast<frame_record*>(begin);
    if (num_frames < num_entries) {
      buf[num_frames] = frame->return_addr;
    }
    ++num_frames;
    if (frame->next_frame < begin + sizeof(frame_record) || frame->next_frame >= end ||
        frame->next_frame % sizeof(void*) != 0) {
      break;
    }
    begin = frame->next_frame;
  }

  return num_frames;
}

在遍历之前，我们首先要给begin和end进行赋值。begin代表的是此时FP寄存器中的值，它是遍历的起始点；而end代表的就是栈底地址（由于栈是向低地址增长的，因此栈底地址最大，因此也可以称为stack_top），用于判定是否该结束遍历。

begin通过__builtin_frame_address的方式来获取，它是编译器的内置函数，在编译期间会转换为特定的汇编指令。转换过程如下，对AArch64架构而言，转换的结果就是读取x29的值（x29也即FP寄存器），这个内置函数在Android 5到Android 11之间的任何版本中都可以正常使用。

case Builtin::BI__builtin_frame_address: {
  Value *Depth = ConstantEmitter(*this).emitAbstract(E->getArg(0),
                                                 getContext().UnsignedIntTy);
  Function *F = CGM.getIntrinsic(Intrinsic::frameaddress, AllocaInt8PtrTy);
  return RValue::get(Builder.CreateCall(F, Depth));
}

接着是end（栈底地址，也可称为stack_top）的获取方式，它也是主线程和其他线程处理不一样的地方，下面来详细介绍。

1. 主线程和其他线程关于栈的区别

从系统的角度来说，我们可以直接通过clone系统调用来创建线程。不过这种方式较为底层，普通开发者用的更多的是pthread的标准API。另外，所有的Java线程在ART虚拟机为其创建时，底层用的也是pthread的方式。

对pthread而言，父进程fork之后，子进程就已经具备了主线程，不过此时的线程状态是父进程的，只有当子进程执行完exec，才能够创建全新的线程。exec会启动linker，最终会通过__libc_init_main_thread_early的方式为主线程新建对应的pthread_internal_t对象，而其他线程则是通过pthread_create的方式创建的。

exec后主线程的栈是由kernel创建的，并且可以动态增长。Kernel首先为主线程分配了128K的栈空间，并且允许它根据栈的实际使用将栈扩容到RLIMIT_STACK的大小。具体说明如下。

The main thread is very special. Its stack was created by kernel and the original size is 128k.
It's automatically expanded if kernel detect it's trying to access the valid stack range.

For generic memory map range, if kernel detect the access is not in this range, the segv will happen.
But for stack memory mapping (map flag with GROWDOWN/GROWUP), if the access is not in the range,
kernel will check whether the mapping could be extended to have the address in the range (mapping
end address - RLIMIT_STACK if GROWDOWN. Or mapping start + RLIMIT_STACK if GROWUP).

The segv just happens if the address is not in the expended stack range.

而其他线程创建时，栈空间是通过mmap的方式申请的，并且自创建之后就固定下来，无法动态增长。当pthread的stack_base没有被人为设定时，pthread_create最终通过__allocate_thread_mapping来创建栈空间，可见其中正是使用mmap完成的申请。

ThreadMapping __allocate_thread_mapping(size_t stack_size, size_t stack_guard_size) {
  const StaticTlsLayout& layout = __libc_shared_globals()->static_tls_layout;

  // Allocate in order: stack guard, stack, static TLS, guard page.
  size_t mmap_size;
  if (__builtin_add_overflow(stack_size, stack_guard_size, &mmap_size)) return {};
  if (__builtin_add_overflow(mmap_size, layout.size(), &mmap_size)) return {};
  if (__builtin_add_overflow(mmap_size, PTHREAD_GUARD_SIZE, &mmap_size)) return {};

  // Align the result to a page size.
  const size_t unaligned_size = mmap_size;
  mmap_size = __BIONIC_ALIGN(mmap_size, PAGE_SIZE);
  if (mmap_size < unaligned_size) return {};

  // Create a new private anonymous map. Make the entire mapping PROT_NONE, then carve out a
  // read+write area in the middle.
  const int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
  char* const space = static_cast<char*>(mmap(nullptr, mmap_size, PROT_NONE, flags, -1, 0));
  if (space == MAP_FAILED) {
    async_safe_format_log(ANDROID_LOG_WARN,
                          "libc",
                          "pthread_create failed: couldn't allocate %zu-bytes mapped space: %s",
                          mmap_size, strerror(errno));
    return {};
  }
  const size_t writable_size = mmap_size - stack_guard_size - PTHREAD_GUARD_SIZE;
  if (mprotect(space + stack_guard_size,
               writable_size,
               PROT_READ | PROT_WRITE) != 0) {
    async_safe_format_log(ANDROID_LOG_WARN, "libc",
                          "pthread_create failed: couldn't mprotect R+W %zu-byte thread mapping region: %s",
                          writable_size, strerror(errno));
    munmap(space, mmap_size);
    return {};
  }

  ThreadMapping result = {};
  result.mmap_base = space;
  result.mmap_size = mmap_size;
  result.mmap_base_unguarded = space + stack_guard_size;
  result.mmap_size_unguarded = mmap_size - stack_guard_size - PTHREAD_GUARD_SIZE;
  result.static_tls = space + mmap_size - PTHREAD_GUARD_SIZE - layout.size();
  result.stack_base = space;
  result.stack_top = result.static_tls;
  return result;
}

因此主线程和其他线程的第一个区别在于栈的创建方式，它也决定了栈是否可以动态增长。这里所说的动态增长可以通过memory maps来直观感受。主线程的栈空间所对应的vma名称为"[stack]"，随着它的动态增长，vma的大小也在不断改变。其他线程的栈空间所对应的vma名称为"anon:stack_and_tls:2082"，2082是它的tid，这块vma的大小自创建之后不会再改变。

Tombstone 1:
0000006d'94ba9000-0000006d'94ca4fff rw-         0     fc000  [anon:stack_and_tls:14390]
0000007f'd9047000-0000007f'd9845fff rw-         0    7ff000  [stack]

Tombstone 2:
00000079'19484000-00000079'1957ffff rw-         0     fc000  [anon:stack_and_tls:15314]
0000007f'ef932000-0000007f'ef952fff rw-         0     21000  [stack] (增加的一页可能是guard页)

除此之外，主线程和其他线程还有一个区别。而它才是stack_top获取方式不同的根本原因。

主线程启动时，会分别调用__libc_init_main_thread_early和__libc_init_main_thread_late来初始化相应的pthread对象。在__libc_init_main_thread_late中，主线程pthread所对应的stack size属性被设置为0，而stack base属性干脆没有设置（默认为0）。因此，我们无法通过pthread_attr_getstack的方式来获取主线程的栈顶地址和栈大小。

__BIONIC_WEAK_FOR_NATIVE_BRIDGE
extern "C" void __libc_init_main_thread_late() {
  __init_bionic_tls_ptrs(__get_bionic_tcb(), __allocate_temp_bionic_tls());

  // Tell the kernel to clear our tid field when we exit, so we're like any other pthread.
  // For threads created by pthread_create, this setup happens during the clone syscall (i.e.
  // CLONE_CHILD_CLEARTID).
  __set_tid_address(&main_thread.tid);

  pthread_attr_init(&main_thread.attr);
  // We don't want to explicitly set the main thread's scheduler attributes (http://b/68328561).
  pthread_attr_setinheritsched(&main_thread.attr, PTHREAD_INHERIT_SCHED);
  // The main thread has no guard page.
  pthread_attr_setguardsize(&main_thread.attr, 0);
  // User code should never see this; we'll compute it when asked.
  pthread_attr_setstacksize(&main_thread.attr, 0);

  // The TLS stack guard is set from the global, so ensure that we've initialized the global
  // before we initialize the TLS. Dynamic executables will initialize their copy of the global
  // stack protector from the one in the main thread's TLS.
  __libc_safe_arc4random_buf(&__stack_chk_guard, sizeof(__stack_chk_guard));
  __init_tcb_stack_guard(__get_bionic_tcb());

  __init_thread(&main_thread);

  __init_additional_stacks(&main_thread);
}

而其他线程启动时，线程栈创建以后，attr的stack_base和stack_size字段都会被设置，因此我们可以通过pthread_attr_getstack的方式来获取其他线程的stack_top（stack_base + stack_size）。

static int __allocate_thread(pthread_attr_t* attr, bionic_tcb** tcbp, void** child_stack) {
  ThreadMapping mapping;
  char* stack_top;
  bool stack_clean = false;

  if (attr->stack_base == nullptr) {
    // The caller didn't provide a stack, so allocate one.

    // Make sure the guard size is a multiple of PAGE_SIZE.
    const size_t unaligned_guard_size = attr->guard_size;
    attr->guard_size = __BIONIC_ALIGN(attr->guard_size, PAGE_SIZE);
    if (attr->guard_size < unaligned_guard_size) return EAGAIN;

    mapping = __allocate_thread_mapping(attr->stack_size, attr->guard_size);
    if (mapping.mmap_base == nullptr) return EAGAIN;

    stack_top = mapping.stack_top;
    attr->stack_base = mapping.stack_base;
    stack_clean = true;
  } else {
    mapping = __allocate_thread_mapping(0, PTHREAD_GUARD_SIZE);
    if (mapping.mmap_base == nullptr) return EAGAIN;

    stack_top = static_cast<char*>(attr->stack_base) + attr->stack_size;
  }
  ...
  attr->stack_size = stack_top - static_cast<char*>(attr->stack_base);
  ...
}

以下是主线程和其他线程栈的具体细节。

pthread栈结构.png

2. Android (≤10)获取栈底地址的方式

对于其他线程来说，通过pthread_attr_getstack的方式就可以拿到stack_base和stack_size，将二者相加便能够得到stack_top。可是对主线程来说，情况会变得复杂一些。

由于主线程的stack_base和stack_size均被设为0，所以pthread_attr_getstack的方式走不通。

在（Android version ≤ 6）的版本中，通过读取memory maps的信息，从中读出名为"[stack]"的vma范围，便可以解析出stack_base和stack_size。不过有一点需要注意，由于主线程栈是动态增长的，因此vma的右侧边界会不断变化（栈往低地址方向增长，因此变化的是右边界，而不是左边界）。为了消除歧义，Intel的工程师和Google的工程师其实就这里有过争议，最终敲定的方案是给主线程栈一个固定的大小RLIMIT_STACK，该大小是主线程栈能够增长到的极限。具体的CL可以参考这两个链接：1和2。

static int __pthread_attr_getstack_main_thread(void** stack_base, size_t* stack_size) {
  ErrnoRestorer errno_restorer;

  rlimit stack_limit;
  if (getrlimit(RLIMIT_STACK, &stack_limit) == -1) {
    return errno;
  }

  // If the current RLIMIT_STACK is RLIM_INFINITY, only admit to an 8MiB stack for sanity's sake.
  if (stack_limit.rlim_cur == RLIM_INFINITY) {
    stack_limit.rlim_cur = 8 * 1024 * 1024;
  }

  // It shouldn't matter which thread we are because we're just looking for "[stack]", but
  // valgrind seems to mess with the stack enough that the kernel will report "[stack:pid]"
  // instead if you look in /proc/self/maps, so we need to look in /proc/pid/task/pid/maps.
  char path[64];
  snprintf(path, sizeof(path), "/proc/self/task/%d/maps", getpid());
  FILE* fp = fopen(path, "re");
  if (fp == NULL) {
    return errno;
  }
  char line[BUFSIZ];
  while (fgets(line, sizeof(line), fp) != NULL) {
    if (ends_with(line, " [stack]\n")) {
      uintptr_t lo, hi;
      if (sscanf(line, "%" SCNxPTR "-%" SCNxPTR, &lo, &hi) == 2) {
        *stack_size = stack_limit.rlim_cur;
        *stack_base = reinterpret_cast<void*>(hi - *stack_size);
        fclose(fp);
        return 0;
      }
    }
  }
  __libc_fatal("No [stack] line found in \"%s\"!", path);
}

不过上述方案从Android 7开启被弃用了。原因是x86/x86_64的平台上，memory maps中"[stack]"名称在某些情况下可能会被标错（复用），详见如下注释。可是如果我们通过"/proc/self/stat"节点的信息先获取stack_start，然后再搜寻包含stack_start的vma区域，那么就能规避这个问题，这也是Android 7上新采用的方案。不过如果我们只考虑AArch64平台的话，memory maps的方式依然没有问题。

// The previous code obtained the main thread's stack by reading the entry in
// /proc/self/task/<pid>/maps that was labeled [stack]. Unfortunately, on x86/x86_64, the kernel
// relies on sp0 in task state segment(tss) to label the stack map with [stack]. If the kernel
// switches a process while the main thread is in an alternate stack, then the kernel will label
// the wrong map with [stack]. This test verifies that when the above situation happens, the main
// thread's stack is found correctly.

3. Android 11中获取栈底地址的方式

不管是主线程还是其他线程，它们都有属于自己的pthread_internal_t对象，如果我们在线程初始化时将stack_top存入对象，这样日后的使用不就方便了很多么？事实上，Android 11正是这么做的。

pthread_internal_t类中新增了一个字段：stack_top，可是如何给它赋值呢？

其他线程好办，因为栈空间是pthread_create时mmap出来的，我们很容易拿到stack_top的值。可是主线程呢？

还记得exec会启动linker这件事么？linker的入口函数会调用__libc_init_main_thread_early来初始化主线程对应的pthread_internal_t对象。函数调用时会传入参数args，而这个参数实际上是从kernel层传上来的，其中就包含了stack_top信息。

extern "C" ElfW(Addr) __linker_init(void* raw_args) {
  // Initialize TLS early so system calls and errno work.
  KernelArgumentBlock args(raw_args);
  bionic_tcb temp_tcb __attribute__((uninitialized));
  linker_memclr(&temp_tcb, sizeof(temp_tcb));
  __libc_init_main_thread_early(args, &temp_tcb);

因此新版本的实现中，主线程的pthread_internal_t初始化时就会给stack_top进行赋值。至于说栈的动态增长，我们这里记录的是栈底地址，不管怎么增长，它都不会改变。另外，其实memory maps的实现方案中也压根没有考虑过栈增长的问题。

main_thread.stack_top = reinterpret_cast<uintptr_t>(args.argv);

4. 如何实现兼容性良好的方案

总体来说，bionic库中的pthread在不同的Android版本中有不同的实现，来满足我们获取stack_top的需求。不过如果只考虑AArch64平台的话，从Android 5一直到10，我们都可以通过如下方式来获取stack_top，不论是不是主线程。

uintptr_t get_stack_top(){
    pthread_attr_t attr;
    if(0 != pthread_getattr_np(pthread_self(), &attr)) return -1;
    uintptr_t stack_base;
    size_t stack_size;
    if(0 != pthread_attr_getstack(&attr, (void **)(&stack_base), &stack_size)) return -1;
    return stack_base + stack_size;
}

不过从Android 11开始，方案变得更简单，我们可以直接通过一句话完成。

uintptr_t stack_top = pthread_self()->stack_top;

题外话，xUnwind方案中对于主线程的特殊处理应该都是参考bionic库来完成的。