[macOS翻译]在 macOS Mojave 上注入 32 位程序

323 阅读11分钟

本文由 简悦 SimpRead转码, 原文地址 rpis.ec

MacOS Mojave 上的 32 位程序可能是 Mac 软件中最不起眼的配置。由于 ......

三月 01, 2020 | 其他, Mac32

macOS Mojave 上的 32 位程序可能是 Mac 软件中最不起眼的配置。由于 Mojave 的各种变化,以前注入 32 位程序的资源 已不再起作用。虽然已经有关于注入 64 位程序的帖子,但 32 位资源尚未更新。本帖详细介绍了我们为 macOS Mojave 上的 32 位应用程序编写库注入工具的工作。

在 macOS 上向进程注入程序的难点在于(按执行顺序排列):

  1. 查找目标进程的 pid
  2. 获取目标进程的机器任务端口
  3. 为 shellcode 和堆栈创建远程内存区域
  4. 用代码生成一个远程线程
  5. 运行 shellcode,调用 dlopen 并调用入口点函数
  6. 清理

查找目标

macOS 有一个根据标识符查找进程的便捷 API。这样,我们就能自动确定目标进程的 pid,而无需手动查找:

int main(int argc, const char *argv[]) {
    @autoreleasepool {
        NSArray *apps = [NSRunningApplication runningApplicationsWithBundleIdentifier:[NSString stringWithUTF8String:argv[1]]];
        if (apps.count == 0) {
            fprintf(stderr, "Cannot find running application\n");
            return 1;
        }
        NSRunningApplication *app = (NSRunningApplication *)apps[0];
        pid = app.processIdentifier;

        // ...
    }
}

获取任务端口

幸运的是,我们已经完成了这部分工作。我们使用的技术基于 Scott Knight 关于注入 64 位进程的博文

task_t remoteTask;
mach_error_t kr = 0;
kr = task_for_pid(mach_task_self(), pid, &remoteTask);
if (kr != KERN_SUCCESS) return -1;

这需要以 root 身份运行注入程序,或使用 com.apple.security.get-task-allow 权限签署目标程序。即便如此,由于 SIP 的原因,即使以 root 身份运行,我们也无法向 Finder.app 或 Dock.app 等受 Apple 保护的进程注入程序。使用 "加固运行时 "编译的进程也可能存在问题;我们没有测试这些进程。

在我们的案例中,我们无法控制目标程序,因此我们以 root 身份运行注入程序。

加载远程内存

Knight 的博文中,我们也已经完成了这项工作。

mach_vm_address_t remoteStack = (vm_address_t) NULL;
mach_vm_address_t remoteCode = (vm_address_t) NULL;

kr = mach_vm_allocate(remoteTask, &remoteStack, STACK_SIZE, VM_FLAGS_ANYWHERE);
if (kr != KERN_SUCCESS) return -2;

kr = mach_vm_allocate(remoteTask, &remoteCode, CODE_SIZE, VM_FLAGS_ANYWHERE);
if (kr != KERN_SUCCESS) return -2;

生成远程线程

再次遵循 Knight,但将所有 64 位结构替换为 32 位结构。

首先设置内存:

char shellcode[] = { ... }; // See below

// Write shellcode into the binary
kr = mach_vm_write(remoteTask, remoteCode, (vm_address_t) shellcode, sizeof(shellcode));
if (kr != KERN_SUCCESS) return -3;

// Mark code as rwx and stack as rw
kr  = vm_protect(remoteTask, remoteCode, CODE_SIZE, FALSE, VM_PROT_READ | VM_PROT_WRITE | VM_PROT_EXECUTE);
if (kr != KERN_SUCCESS) return -4;

kr  = vm_protect(remoteTask, remoteStack, STACK_SIZE, TRUE, VM_PROT_READ | VM_PROT_WRITE);
if (kr != KERN_SUCCESS) return -4;

然后建立一个线程:

x86_thread_state32_t remoteThreadState;
bzero(&remoteThreadState, sizeof(remoteThreadState));

// Make space because the stack grows down
remoteStack += (STACK_SIZE / 2);

remoteThreadState.__eip = (u_int32_t) remoteCode;
remoteThreadState.__esp = (u_int32_t) remoteStack;
remoteThreadState.__ebp = (u_int32_t) remoteStack;

然后启动线程:

thread_act_t remoteThread;

kr = thread_create_running(remoteTask, x86_THREAD_STATE32, (thread_state_t)&remoteThreadState, x86_THREAD_STATE32_COUNT, &remoteThread);
if (kr != KERN_SUCCESS) return -5;

这就是我们创建一个带栈和代码的远程线程所需的全部内容。

外壳代码

此时,我们必须编写注入线程将运行的代码。由于我们不能简单地注入一个库,因此需要以 shellcode 的形式编写代码。理想情况下,这段代码应在用户指定的库路径上调用 dlopen

由于 XNU 的双内核特性,使用 thread_create_running 创建线程会导致线程被破坏,该线程只存在于 Mach 内核中,在 BSD 内核中没有对应的线程。因此,如果调用大多数系统调用,进程都会崩溃。在 macOS Mojave 之前,你可以在线程上调用 _pthread_set_self(NULL)(或 10.12 之前的 __pthread_set_self(NULL))来恢复这一功能,但现在已经不可能了。相反,正如 Knight所发现的,如果传入 NULL_pthread_set_self 将直接导致进程崩溃。因此,我们要像他那样使用 pthread_create_from_mach_thread,为我们真正的有效负载创建一个新的、未中断的线程。

这将把我们的 shellcode 分成两个有效载荷:

  1. 初始代码,只调用pthread_create_from_mach_thread从中断的线程
  2. 使用 dlopen 加载我们的库的第二阶段代码

外壳代码: 第 1 阶段

为了在为注入进程上的中断线程编写 shellcode 时保持理智,我们选择不手工编写汇编,而是使用 Binary Ninja(无耻的插播)中包含的 Shellcode Compiler。它允许我们编写 C 代码并自动编译成 shellcode,而无需手工编写 x86。

编写这个有效载荷有几个关键技巧:

  • 将函数标记为 __stdcall,因为默认情况下,scc 使用不同的约定。
  • 将外部函数指针分配给占位符值。在注入器中,我们将用真正的指针替换这些指针。
  • 我们不能终止这个线程,因为它太残缺了。因此,我们要无限循环,直到另一个线程稍后可以杀死它。
// Function types need to be marked with __stdcall or else scc
// will not use the right calling convention
typedef void *(__stdcall *pthread_start_t)(void *);
typedef int (__stdcall *pthread_create_from_mach_thread_t)(pthread_t *thread, const pthread_attr_t *attr, pthread_start_t start_routine, void *arg);

// External function pointers marked with placeholders
pthread_create_from_mach_thread_t pthread_create_from_mach_thread = (pthread_create_from_mach_thread_t)0x41414141;
pthread_start_t start_thread = (pthread_start_t)0x42424242;

int main() {
    pthread_t thread;
    int ret = pthread_create_from_mach_thread(&thread, NULL, start_thread, NULL);

    while (ret == 0) {
        // Wait for death
    }

    // If we get here pthread_create failed and the process is going extremely down
    __breakpoint();
}

shellcode: 第 2 阶段

创建真正的 PT 线程 (TM) 后,我们就可以开始调用函数了。令人困惑的是,如果 dlopen 的线程基本上做了其他事情,它就会崩溃,所以我们要生成一个新的 pthread,只调用 dlopen。这似乎满足了它的要求。此外,既然我们已经进入了真正的 pthread,就可以使用 dlsym 来解析函数,而无需事先解析。

typedef void *(__stdcall *dlopen_t)(const char* path, int mode);
dlopen_t dlopen = (dlopen_t)0x43434343;

typedef void *(__stdcall *dlsym_t)(void *path, const char *symbol);
dlsym_t dlsym = (dlsym_t)0x44444444;

const char *path = (const char*)0x30303030;
int mode = 2;
void *thread_fn(void *arg) {
    // If you do anything else, this thread will die in dlopen
    // (Even if you do it *after* dlopen!)
    return dlopen(path, mode);
}

typedef int (__stdcall *pthread_create_t)(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine)(void *), void *arg);
typedef int (__stdcall *pthread_join_t)(pthread_t thread, void **value_ptr);
typedef int (__stdcall *pthread_detach_t)(pthread_t thread);
#define RTLD_DEFAULT -2

struct params_t {
    void *shellcode;
    void *user_info;
};

struct params_t params;

int main() {
    // Stage 2 main
    pthread_create_t pthread_create = (pthread_create_t)dlsym((void*)RTLD_DEFAULT, "pthread_create");
    pthread_join_t pthread_join = (pthread_join_t)dlsym((void*)RTLD_DEFAULT, "pthread_join");
    pthread_detach_t pthread_detach = (pthread_detach_t)dlsym((void*)RTLD_DEFAULT, "pthread_detach");


    // We need to open stuff, but calling dlopen is really sketchy. So do it in another thread
    // whose sole job is to call dlopen and get outta there as fast as possible.
    // Also it returns the pointer to the module so we can dlsym() the entry point.
    pthread_t thread;
    int ret = pthread_create(&thread, NULL, thread_fn, NULL);
    void *lib;
    pthread_join(thread, &lib);

    // ...
}

然后,我们可以使用 dlopendlsym 的结果来找到注入库的入口点。由于注入库负责清理中断的注入线程,因此我们在另一个线程中生成它的入口点,并等待它清理第二阶段的线程。我们还会在 shellcode 代码段中的某个地方放置一个指向入口点的指针,这样它就能找到两个 shellcode 线程并终止它们。

    // ...

    // Find and jump to entry point of inject library
    void *(*entry)(void *) = (void*(*)(void *))dlsym(lib, "inj_entry");
    // Pass as argument the code pointer so it can find this thread (and the super janky
    // first thread) and kill us.
    params.shellcode = (void *)0x42424242;
    params.user_info = (void *)0x45454545;
    ret = pthread_create(&thread, NULL, entry, (void *)&params);
    pthread_detach(thread);

    while (ret == 0) {
        // death().await
    }

    // If we get here pthread_create failed and we can't inject
    __breakpoint();
}

最后,经过这么长时间,我们终于在一个真正的库中像 C 语言一样执行了。

清理线程: 第三阶段

在库中开始执行后,我们将在目标进程的上下文中运行,并能轻松找到其线程。在这里,我们可以遍历这些线程,并杀死任何与 shellcode 位于同一页面的线程。

struct params_t {
    void *shellcode;
    void *user_info;
};

void inj_entry(struct params_t *params) {
    printf("Start! %p %p %p\n", params, params->shellcode, params->user_info);

    // The janky threads are going to be somewhere in this page
    // thread_start is a pointer to the start address of one of them
    uint32_t hack_page = (uint32_t)params->shellcode & ~(0xFFF);

    // Find all the currently running threads
    printf("Activate! Thread start at %p\n", params->shellcode);
    thread_act_array_t thread_list;
    mach_msg_type_number_t list_count;
    kern_return_t err = task_threads(mach_task_self_, &thread_list, &list_count);
    if (err != KERN_SUCCESS) {
        printf("Could not get threads: %s\n", mach_error_string(err));
        return;
    }

    for (int i = 0; i < list_count; i ++) {
        thread_act_t thread = thread_list[i];
        if (thread == mach_thread_self()) {
            printf("We are thread %d\n", i);
            continue;
        }

        // Find where this thread is at
        x86_thread_state32_t old_state;
        mach_msg_type_number_t state_count = x86_THREAD_STATE32_COUNT;
        err = thread_get_state(thread, x86_THREAD_STATE32, (thread_state_t)&old_state, &state_count);

        if (err != KERN_SUCCESS) {
            printf("Could not get thread info for thread %d: %s\n", i, mach_error_string(err));
            return;
        }

        // If this is one of the janky threads, kill it (gently)
        if ((old_state.__eip & ~(0xFFF)) == hack_page) {
            printf("Janky thread %d eip 0x%08u (start is 0x%08u)\n", i, old_state.__eip, hack_page);
            // If you don't suspend the thread before terminating it, the Finder will Find you in your sleep
            thread_suspend(thread);
            thread_terminate(thread);
        }
    }

    // Call user code

    const char *lib_path = (const char *)params->user_info;
    const char *lib_fn = lib_path + strlen(lib_path) + 1;

    printf("Loading %s\n", lib_path);
    void *lib = dlopen(lib_path, RTLD_NOW);

    printf("Loaded at %p\n", lib);
    void (*fn)(void) = (void(*)(void))dlsym(lib, lib_fn);

    printf("Running %s::%s() at %p\n", lib_path, lib_fn, fn);
    if (fn != NULL) {
        fn();
    }
}

解析函数

现在我们有了有效负载,只需在运行前动态解析其外部函数即可。通常情况下,macOS 可以通过系统库和 dyld 共享缓存轻松实现这一点,但如果我们尝试从调试器运行代码,这种方法就行不通了,因为 lldb 会将自己版本的 libsystem_pthread.dylib 注入到我们调试的进程中。因此,我们改为手动解析。我们的解析器基于 Stanislas Lejay 的 "Playing with Mach-O binaries and dyld",但对其进行了更新,以便通过 mach 任务端口从目标进程读取内存。

首先是一个从目标进程读取虚拟内存的便利助手:

// Read memory from vmaddr in task of length bytes
// Returns a malloc'd buffer 
char *virtual_read(task_t task, mach_vm_address_t vmaddr, mach_vm_size_t length) {
    char *memory = (char *)malloc((size_t) length);
    mach_vm_offset_t output = (mach_vm_offset_t)memory;
    mach_vm_size_t outsize;
    kern_return_t ret;

    ret = mach_vm_read_overwrite(task, vmaddr, length, output, &outsize);

    if (ret != KERN_SUCCESS) return NULL;

    return (char *)output;
}

然后,为了开始解析函数,我们需要找到目标进程中库的基地址。同样,根据"Playing with Mach-O binaries and dyld"

static uint32_t find_function32(task_t task, char *base, char *shared_cache_rx_base, const char *fnname)
{
    struct mach_header *base_header = (struct mach_header *)virtual_read(task, (mach_vm_address_t)base, sizeof(struct mach_header));
    uint32_t ncmds = base_header->ncmds;
    free(base_header);
    struct symtab_command *symcmd = NULL;

    mach_vm_address_t start = (mach_vm_address_t)(base + sizeof(struct mach_header));

    // Get symtab and dysymtab
    for (uint32_t i = 0; i < ncmds; ++i) {
        struct segment_command *cmd = (struct segment_command *)virtual_read(task, start, 0x100);

        if (cmd->cmd == LC_SYMTAB) {
            symcmd = (struct symtab_command*)cmd;
            break;
        }
        start += cmd->cmdsize;
        free(cmd);
    }

    // We need to resolve where the symbol/string tables are in the target memory
    mach_vm_address_t strtab_start = 0;
    mach_vm_address_t symtab_start = 0;
    // Also need the base address of the binary (different with cache)
    uint64_t aslr_slide = 0;

    // If this library is in the shared cache then use that instead
    if (base >= shared_cache_rx_base) {
        // "Playing with Mach-O binaries and dyld", but virtual_read
        dyld_cache_header *cache_header = (dyld_cache_header *)virtual_read(task, (mach_vm_address_t)shared_cache_rx_base, sizeof(dyld_cache_header));

        size_t rx_size = 0;
        size_t rw_size = 0;
        size_t rx_addr = 0;
        size_t ro_addr = 0;
        off_t ro_off = 0;

        for (int i = 0; i < cache_header->mappingCount; ++i) {
            shared_file_mapping_np *mapping = (shared_file_mapping_np *)virtual_read(task, (mach_vm_address_t)shared_cache_rx_base + cache_header->mappingOffset + sizeof(shared_file_mapping_np) * i, sizeof(shared_file_mapping_np));

            if (mapping->init_prot & VM_PROT_EXECUTE) {
                // Get size and address of [R-X] mapping
                rx_size = (size_t)mapping->size;
                rx_addr = (size_t)mapping->address;
            } else if (mapping->init_prot & VM_PROT_WRITE) {
                // Get size of [RW-] mapping
                rw_size = (size_t)mapping->size;
            } else if (mapping->init_prot == VM_PROT_READ) {
                // Get file offset of [R--] mapping
                ro_off = (size_t)mapping->file_offset;
                ro_addr = (size_t)mapping->address;
            }

            free(mapping);
        }
        free(cache_header);

        aslr_slide = (uint64_t)shared_cache_rx_base - rx_addr;

        char *shared_cache_ro = (char*)(ro_addr + aslr_slide);
        uint64_t stroff_from_ro = symcmd->stroff - rx_size - rw_size;
        uint64_t symoff_from_ro = symcmd->symoff - rx_size - rw_size;

        strtab_start = (mach_vm_address_t)(shared_cache_ro + stroff_from_ro);
        symtab_start = (mach_vm_address_t)(shared_cache_ro + symoff_from_ro);
    } else {
        // Otherwise just use the base address of the library
        aslr_slide = (uint64_t)base;
        strtab_start = (mach_vm_address_t)base + symcmd->stroff;
        symtab_start = (mach_vm_address_t)base + symcmd->symoff;
    }

    char *strtab = (char *)virtual_read(task, strtab_start, symcmd->strsize);
    struct nlist *symtab = (struct nlist *)virtual_read(task, symtab_start, symcmd->nsyms * sizeof(struct nlist));

    for (uint32_t i = 0; i < symcmd->nsyms; ++i){
        uint32_t strtab_off = symtab[i].n_un.n_strx;
        uint32_t func       = symtab[i].n_value;

        if(strcmp(&strtab[strtab_off], fnname) == 0) {
            free(strtab);
            free(symtab);
            return (uint32_t)func + aslr_slide;
        }
    }

    free(strtab);
    free(symtab);
    return 0;
}

然后,我们可以将这些部分组合在一起,找到目标进程中任何函数的地址:

uint32_t task_dlsym32(task_t task, pid_t pid, const char *libName, const char *fnName) {
    char *shared_cache_base;
    char *lib_base = find_lib32(task, libName, &shared_cache_base);
    uint32_t fn_guest = find_function32(task, lib_base, shared_cache_base, fnName);

    return (uint32_t)fn_guest;
}

修补外壳代码

现在我们已经解决了地址问题,需要将它们补丁到 shellcode 中。此外,我们还需要将各种字符串修补为库路径。

首先,我们要定义一个地址,用于放置 Stage 3 库的路径和 inj_entry 的参数地址,然后写入这些内容:

uint32_t injLibAddr = remoteCode + CODE_SIZE - 0x80;
uint32_t injLibParamsAddr = remoteStack + STACK_SIZE * 3 / 4;

// File path of injection library
kr = mach_vm_write(remoteTask, injLibAddr, (vm_offset_t)injectLib, strlen(injectLib) + 1);
if (kr != KERN_SUCCESS) return -2;

// Parameters: currently <user library path>\0<user library function>\0
kr = mach_vm_write(remoteTask, injLibParamsAddr, (vm_offset_t)lib, strlen(lib) + 1);
if (kr != KERN_SUCCESS) return -2;
kr = mach_vm_write(remoteTask, injLibParamsAddr + strlen(lib) + 1, (vm_offset_t)fn, strlen(fn) + 1);
if (kr != KERN_SUCCESS) return -2;

然后,与 shellcode 中要替换为地址的字符串相匹配的重映射列表:

struct remap {
    const char *search;
    uint32_t replace;
    int replace_count;
};

struct remap remaps[] = {
    {"0000", (uint32_t)injLibAddr, 0},
    {"AAAA", (uint32_t)task_dlsym32(remoteTask, pid, "libsystem_pthread.dylib", "_pthread_create_from_mach_thread"), 0},
    {"BBBB", (uint32_t)remoteCode + sc1_length, 0},
    {"CCCC", (uint32_t)task_dlsym32(remoteTask, pid, "libdyld.dylib", "_dlopen"), 0},
    {"DDDD", (uint32_t)task_dlsym32(remoteTask, pid, "libdyld.dylib", "_dlsym"), 0},
    {"EEEE", (uint32_t)injLibParamsAddr, 0},
};

值得注意的是,这里包含

  • 我们需要调用的各种外部函数
  • 第 2 阶段 shellcode 的位置
  • dlopen()ed 注入库字符串的位置
  • inj_entry "参数的位置

然后,我们只需遍历 shellcode,根据每个重映射模式检查每个偏移量,并按照要求替换字节:

char *possiblePatchLocation = (char*)(shellcode);
for (int i = 0 ; i < sizeof(shellcode); i++) {
    possiblePatchLocation++;
    for (int j = 0; j < sizeof(remaps) / sizeof(*remaps); j ++) {
        if (memcmp(possiblePatchLocation, remaps[j].search, 4) == 0) {
            memcpy(possiblePatchLocation, &remaps[j].replace, 4);
            remaps[j].replace_count ++;
        }
    }
}

使用 SCC 自动编写 Shellcode

在编写这个注入框架的过程中,我们需要测试大量的 shellcode。方便的是,scc 自带命令行接口,可以通过 subproccess.run()从 python 编写脚本:

import os
import subprocess

# Xcode provides us this environment variable with the root directory of the project
project_dir = os.environ["PROJECT_DIR"]
proc1 = subprocess.run(["scc", "--platform", "mac", "--arch", "x86", project_dir+"/testinj/shellcode.c", "--stdout"], stdout=subprocess.PIPE)
proc2 = subprocess.run(["scc", "--platform", "mac", "--arch", "x86", project_dir+"/testinj/shellcode2.c", "--stdout"], stdout=subprocess.PIPE)

# Pad shellcodes with interrupts, just in case
proc1_output = proc1.stdout
proc1_output = proc1_output.ljust((len(proc1_output) + 0xf) & ~0xf, b'\xcc')
proc2_output = proc2.stdout
proc2_output = proc2_output.ljust((len(proc2_output) + 0xf) & ~0xf, b'\xcc')

从这里开始,我们可以简单地将 shellcode 输出格式化为 C 风格的头文件,我们的注入进程可以#include。此外,我们还定义了一些额外的变量来辅助注入代码:

# Combine shellcode into a C-style array
shellcodes = proc1_output + proc2_output
formatted = "unsigned char shellcode[] = {" + ", ".join("0x{:02X}".format(b) for b in shellcodes) + "};\n"

# Page align
code_size = len(shellcodes) + 0x100
code_size = (code_size + 0xfff) & ~0xfff

# Write to a C header file for main.m to include
with open(project_dir + "/testinj/shellcode.h", "w") as f:
    f.write(formatted)
    f.write("uint32_t sc1_length = 0x{:x};\n".format(len(proc1_output)))
    f.write("#define CODE_SIZE 0x{:x}\n".format(code_size))

结论

与一团糟的 macOS 相比,在 Windows 上向远程进程注入程序似乎微不足道。这个过程涉及 shellcode、被破坏的半线程,以及阅读几乎不存在的 Mach 文档约 10 个小时。最后,我们终于在最后一个支持 32 位应用程序的 macOS 版本上找到了一个注入 32 位应用程序的工具。

参考资料

源代码: GitHub