本文由 简悦 SimpRead转码, 原文地址 rpis.ec
MacOS Mojave 上的 32 位程序可能是 Mac 软件中最不起眼的配置。由于 ......
三月 01, 2020 | 其他, Mac32
macOS Mojave 上的 32 位程序可能是 Mac 软件中最不起眼的配置。由于 Mojave 的各种变化,以前注入 32 位程序的资源 已不再起作用。虽然已经有关于注入 64 位程序的帖子,但 32 位资源尚未更新。本帖详细介绍了我们为 macOS Mojave 上的 32 位应用程序编写库注入工具的工作。
在 macOS 上向进程注入程序的难点在于(按执行顺序排列):
- 查找目标进程的 pid
- 获取目标进程的机器任务端口
- 为 shellcode 和堆栈创建远程内存区域
- 用代码生成一个远程线程
- 运行 shellcode,调用
dlopen并调用入口点函数 - 清理
查找目标
macOS 有一个根据标识符查找进程的便捷 API。这样,我们就能自动确定目标进程的 pid,而无需手动查找:
int main(int argc, const char *argv[]) {
@autoreleasepool {
NSArray *apps = [NSRunningApplication runningApplicationsWithBundleIdentifier:[NSString stringWithUTF8String:argv[1]]];
if (apps.count == 0) {
fprintf(stderr, "Cannot find running application\n");
return 1;
}
NSRunningApplication *app = (NSRunningApplication *)apps[0];
pid = app.processIdentifier;
// ...
}
}
获取任务端口
幸运的是,我们已经完成了这部分工作。我们使用的技术基于 Scott Knight 关于注入 64 位进程的博文。
task_t remoteTask;
mach_error_t kr = 0;
kr = task_for_pid(mach_task_self(), pid, &remoteTask);
if (kr != KERN_SUCCESS) return -1;
这需要以 root 身份运行注入程序,或使用 com.apple.security.get-task-allow 权限签署目标程序。即便如此,由于 SIP 的原因,即使以 root 身份运行,我们也无法向 Finder.app 或 Dock.app 等受 Apple 保护的进程注入程序。使用 "加固运行时 "编译的进程也可能存在问题;我们没有测试这些进程。
在我们的案例中,我们无法控制目标程序,因此我们以 root 身份运行注入程序。
加载远程内存
在Knight 的博文中,我们也已经完成了这项工作。
mach_vm_address_t remoteStack = (vm_address_t) NULL;
mach_vm_address_t remoteCode = (vm_address_t) NULL;
kr = mach_vm_allocate(remoteTask, &remoteStack, STACK_SIZE, VM_FLAGS_ANYWHERE);
if (kr != KERN_SUCCESS) return -2;
kr = mach_vm_allocate(remoteTask, &remoteCode, CODE_SIZE, VM_FLAGS_ANYWHERE);
if (kr != KERN_SUCCESS) return -2;
生成远程线程
再次遵循 Knight,但将所有 64 位结构替换为 32 位结构。
首先设置内存:
char shellcode[] = { ... }; // See below
// Write shellcode into the binary
kr = mach_vm_write(remoteTask, remoteCode, (vm_address_t) shellcode, sizeof(shellcode));
if (kr != KERN_SUCCESS) return -3;
// Mark code as rwx and stack as rw
kr = vm_protect(remoteTask, remoteCode, CODE_SIZE, FALSE, VM_PROT_READ | VM_PROT_WRITE | VM_PROT_EXECUTE);
if (kr != KERN_SUCCESS) return -4;
kr = vm_protect(remoteTask, remoteStack, STACK_SIZE, TRUE, VM_PROT_READ | VM_PROT_WRITE);
if (kr != KERN_SUCCESS) return -4;
然后建立一个线程:
x86_thread_state32_t remoteThreadState;
bzero(&remoteThreadState, sizeof(remoteThreadState));
// Make space because the stack grows down
remoteStack += (STACK_SIZE / 2);
remoteThreadState.__eip = (u_int32_t) remoteCode;
remoteThreadState.__esp = (u_int32_t) remoteStack;
remoteThreadState.__ebp = (u_int32_t) remoteStack;
然后启动线程:
thread_act_t remoteThread;
kr = thread_create_running(remoteTask, x86_THREAD_STATE32, (thread_state_t)&remoteThreadState, x86_THREAD_STATE32_COUNT, &remoteThread);
if (kr != KERN_SUCCESS) return -5;
这就是我们创建一个带栈和代码的远程线程所需的全部内容。
外壳代码
此时,我们必须编写注入线程将运行的代码。由于我们不能简单地注入一个库,因此需要以 shellcode 的形式编写代码。理想情况下,这段代码应在用户指定的库路径上调用 dlopen。
由于 XNU 的双内核特性,使用 thread_create_running 创建线程会导致线程被破坏,该线程只存在于 Mach 内核中,在 BSD 内核中没有对应的线程。因此,如果调用大多数系统调用,进程都会崩溃。在 macOS Mojave 之前,你可以在线程上调用 _pthread_set_self(NULL)(或 10.12 之前的 __pthread_set_self(NULL))来恢复这一功能,但现在已经不可能了。相反,正如 Knight所发现的,如果传入 NULL,_pthread_set_self 将直接导致进程崩溃。因此,我们要像他那样使用 pthread_create_from_mach_thread,为我们真正的有效负载创建一个新的、未中断的线程。
这将把我们的 shellcode 分成两个有效载荷:
- 初始代码,只调用
pthread_create_from_mach_thread从中断的线程 - 使用
dlopen加载我们的库的第二阶段代码
外壳代码: 第 1 阶段
为了在为注入进程上的中断线程编写 shellcode 时保持理智,我们选择不手工编写汇编,而是使用 Binary Ninja(无耻的插播)中包含的 Shellcode Compiler。它允许我们编写 C 代码并自动编译成 shellcode,而无需手工编写 x86。
编写这个有效载荷有几个关键技巧:
- 将函数标记为 __stdcall,因为默认情况下,scc 使用不同的约定。
- 将外部函数指针分配给占位符值。在注入器中,我们将用真正的指针替换这些指针。
- 我们不能终止这个线程,因为它太残缺了。因此,我们要无限循环,直到另一个线程稍后可以杀死它。
// Function types need to be marked with __stdcall or else scc
// will not use the right calling convention
typedef void *(__stdcall *pthread_start_t)(void *);
typedef int (__stdcall *pthread_create_from_mach_thread_t)(pthread_t *thread, const pthread_attr_t *attr, pthread_start_t start_routine, void *arg);
// External function pointers marked with placeholders
pthread_create_from_mach_thread_t pthread_create_from_mach_thread = (pthread_create_from_mach_thread_t)0x41414141;
pthread_start_t start_thread = (pthread_start_t)0x42424242;
int main() {
pthread_t thread;
int ret = pthread_create_from_mach_thread(&thread, NULL, start_thread, NULL);
while (ret == 0) {
// Wait for death
}
// If we get here pthread_create failed and the process is going extremely down
__breakpoint();
}
shellcode: 第 2 阶段
创建真正的 PT 线程 (TM) 后,我们就可以开始调用函数了。令人困惑的是,如果 dlopen 的线程基本上做了其他事情,它就会崩溃,所以我们要生成一个新的 pthread,只调用 dlopen。这似乎满足了它的要求。此外,既然我们已经进入了真正的 pthread,就可以使用 dlsym 来解析函数,而无需事先解析。
typedef void *(__stdcall *dlopen_t)(const char* path, int mode);
dlopen_t dlopen = (dlopen_t)0x43434343;
typedef void *(__stdcall *dlsym_t)(void *path, const char *symbol);
dlsym_t dlsym = (dlsym_t)0x44444444;
const char *path = (const char*)0x30303030;
int mode = 2;
void *thread_fn(void *arg) {
// If you do anything else, this thread will die in dlopen
// (Even if you do it *after* dlopen!)
return dlopen(path, mode);
}
typedef int (__stdcall *pthread_create_t)(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine)(void *), void *arg);
typedef int (__stdcall *pthread_join_t)(pthread_t thread, void **value_ptr);
typedef int (__stdcall *pthread_detach_t)(pthread_t thread);
#define RTLD_DEFAULT -2
struct params_t {
void *shellcode;
void *user_info;
};
struct params_t params;
int main() {
// Stage 2 main
pthread_create_t pthread_create = (pthread_create_t)dlsym((void*)RTLD_DEFAULT, "pthread_create");
pthread_join_t pthread_join = (pthread_join_t)dlsym((void*)RTLD_DEFAULT, "pthread_join");
pthread_detach_t pthread_detach = (pthread_detach_t)dlsym((void*)RTLD_DEFAULT, "pthread_detach");
// We need to open stuff, but calling dlopen is really sketchy. So do it in another thread
// whose sole job is to call dlopen and get outta there as fast as possible.
// Also it returns the pointer to the module so we can dlsym() the entry point.
pthread_t thread;
int ret = pthread_create(&thread, NULL, thread_fn, NULL);
void *lib;
pthread_join(thread, &lib);
// ...
}
然后,我们可以使用 dlopen 和 dlsym 的结果来找到注入库的入口点。由于注入库负责清理中断的注入线程,因此我们在另一个线程中生成它的入口点,并等待它清理第二阶段的线程。我们还会在 shellcode 代码段中的某个地方放置一个指向入口点的指针,这样它就能找到两个 shellcode 线程并终止它们。
// ...
// Find and jump to entry point of inject library
void *(*entry)(void *) = (void*(*)(void *))dlsym(lib, "inj_entry");
// Pass as argument the code pointer so it can find this thread (and the super janky
// first thread) and kill us.
params.shellcode = (void *)0x42424242;
params.user_info = (void *)0x45454545;
ret = pthread_create(&thread, NULL, entry, (void *)¶ms);
pthread_detach(thread);
while (ret == 0) {
// death().await
}
// If we get here pthread_create failed and we can't inject
__breakpoint();
}
最后,经过这么长时间,我们终于在一个真正的库中像 C 语言一样执行了。
清理线程: 第三阶段
在库中开始执行后,我们将在目标进程的上下文中运行,并能轻松找到其线程。在这里,我们可以遍历这些线程,并杀死任何与 shellcode 位于同一页面的线程。
struct params_t {
void *shellcode;
void *user_info;
};
void inj_entry(struct params_t *params) {
printf("Start! %p %p %p\n", params, params->shellcode, params->user_info);
// The janky threads are going to be somewhere in this page
// thread_start is a pointer to the start address of one of them
uint32_t hack_page = (uint32_t)params->shellcode & ~(0xFFF);
// Find all the currently running threads
printf("Activate! Thread start at %p\n", params->shellcode);
thread_act_array_t thread_list;
mach_msg_type_number_t list_count;
kern_return_t err = task_threads(mach_task_self_, &thread_list, &list_count);
if (err != KERN_SUCCESS) {
printf("Could not get threads: %s\n", mach_error_string(err));
return;
}
for (int i = 0; i < list_count; i ++) {
thread_act_t thread = thread_list[i];
if (thread == mach_thread_self()) {
printf("We are thread %d\n", i);
continue;
}
// Find where this thread is at
x86_thread_state32_t old_state;
mach_msg_type_number_t state_count = x86_THREAD_STATE32_COUNT;
err = thread_get_state(thread, x86_THREAD_STATE32, (thread_state_t)&old_state, &state_count);
if (err != KERN_SUCCESS) {
printf("Could not get thread info for thread %d: %s\n", i, mach_error_string(err));
return;
}
// If this is one of the janky threads, kill it (gently)
if ((old_state.__eip & ~(0xFFF)) == hack_page) {
printf("Janky thread %d eip 0x%08u (start is 0x%08u)\n", i, old_state.__eip, hack_page);
// If you don't suspend the thread before terminating it, the Finder will Find you in your sleep
thread_suspend(thread);
thread_terminate(thread);
}
}
// Call user code
const char *lib_path = (const char *)params->user_info;
const char *lib_fn = lib_path + strlen(lib_path) + 1;
printf("Loading %s\n", lib_path);
void *lib = dlopen(lib_path, RTLD_NOW);
printf("Loaded at %p\n", lib);
void (*fn)(void) = (void(*)(void))dlsym(lib, lib_fn);
printf("Running %s::%s() at %p\n", lib_path, lib_fn, fn);
if (fn != NULL) {
fn();
}
}
解析函数
现在我们有了有效负载,只需在运行前动态解析其外部函数即可。通常情况下,macOS 可以通过系统库和 dyld 共享缓存轻松实现这一点,但如果我们尝试从调试器运行代码,这种方法就行不通了,因为 lldb 会将自己版本的 libsystem_pthread.dylib 注入到我们调试的进程中。因此,我们改为手动解析。我们的解析器基于 Stanislas Lejay 的 "Playing with Mach-O binaries and dyld",但对其进行了更新,以便通过 mach 任务端口从目标进程读取内存。
首先是一个从目标进程读取虚拟内存的便利助手:
// Read memory from vmaddr in task of length bytes
// Returns a malloc'd buffer
char *virtual_read(task_t task, mach_vm_address_t vmaddr, mach_vm_size_t length) {
char *memory = (char *)malloc((size_t) length);
mach_vm_offset_t output = (mach_vm_offset_t)memory;
mach_vm_size_t outsize;
kern_return_t ret;
ret = mach_vm_read_overwrite(task, vmaddr, length, output, &outsize);
if (ret != KERN_SUCCESS) return NULL;
return (char *)output;
}
然后,为了开始解析函数,我们需要找到目标进程中库的基地址。同样,根据"Playing with Mach-O binaries and dyld":
static uint32_t find_function32(task_t task, char *base, char *shared_cache_rx_base, const char *fnname)
{
struct mach_header *base_header = (struct mach_header *)virtual_read(task, (mach_vm_address_t)base, sizeof(struct mach_header));
uint32_t ncmds = base_header->ncmds;
free(base_header);
struct symtab_command *symcmd = NULL;
mach_vm_address_t start = (mach_vm_address_t)(base + sizeof(struct mach_header));
// Get symtab and dysymtab
for (uint32_t i = 0; i < ncmds; ++i) {
struct segment_command *cmd = (struct segment_command *)virtual_read(task, start, 0x100);
if (cmd->cmd == LC_SYMTAB) {
symcmd = (struct symtab_command*)cmd;
break;
}
start += cmd->cmdsize;
free(cmd);
}
// We need to resolve where the symbol/string tables are in the target memory
mach_vm_address_t strtab_start = 0;
mach_vm_address_t symtab_start = 0;
// Also need the base address of the binary (different with cache)
uint64_t aslr_slide = 0;
// If this library is in the shared cache then use that instead
if (base >= shared_cache_rx_base) {
// "Playing with Mach-O binaries and dyld", but virtual_read
dyld_cache_header *cache_header = (dyld_cache_header *)virtual_read(task, (mach_vm_address_t)shared_cache_rx_base, sizeof(dyld_cache_header));
size_t rx_size = 0;
size_t rw_size = 0;
size_t rx_addr = 0;
size_t ro_addr = 0;
off_t ro_off = 0;
for (int i = 0; i < cache_header->mappingCount; ++i) {
shared_file_mapping_np *mapping = (shared_file_mapping_np *)virtual_read(task, (mach_vm_address_t)shared_cache_rx_base + cache_header->mappingOffset + sizeof(shared_file_mapping_np) * i, sizeof(shared_file_mapping_np));
if (mapping->init_prot & VM_PROT_EXECUTE) {
// Get size and address of [R-X] mapping
rx_size = (size_t)mapping->size;
rx_addr = (size_t)mapping->address;
} else if (mapping->init_prot & VM_PROT_WRITE) {
// Get size of [RW-] mapping
rw_size = (size_t)mapping->size;
} else if (mapping->init_prot == VM_PROT_READ) {
// Get file offset of [R--] mapping
ro_off = (size_t)mapping->file_offset;
ro_addr = (size_t)mapping->address;
}
free(mapping);
}
free(cache_header);
aslr_slide = (uint64_t)shared_cache_rx_base - rx_addr;
char *shared_cache_ro = (char*)(ro_addr + aslr_slide);
uint64_t stroff_from_ro = symcmd->stroff - rx_size - rw_size;
uint64_t symoff_from_ro = symcmd->symoff - rx_size - rw_size;
strtab_start = (mach_vm_address_t)(shared_cache_ro + stroff_from_ro);
symtab_start = (mach_vm_address_t)(shared_cache_ro + symoff_from_ro);
} else {
// Otherwise just use the base address of the library
aslr_slide = (uint64_t)base;
strtab_start = (mach_vm_address_t)base + symcmd->stroff;
symtab_start = (mach_vm_address_t)base + symcmd->symoff;
}
char *strtab = (char *)virtual_read(task, strtab_start, symcmd->strsize);
struct nlist *symtab = (struct nlist *)virtual_read(task, symtab_start, symcmd->nsyms * sizeof(struct nlist));
for (uint32_t i = 0; i < symcmd->nsyms; ++i){
uint32_t strtab_off = symtab[i].n_un.n_strx;
uint32_t func = symtab[i].n_value;
if(strcmp(&strtab[strtab_off], fnname) == 0) {
free(strtab);
free(symtab);
return (uint32_t)func + aslr_slide;
}
}
free(strtab);
free(symtab);
return 0;
}
然后,我们可以将这些部分组合在一起,找到目标进程中任何函数的地址:
uint32_t task_dlsym32(task_t task, pid_t pid, const char *libName, const char *fnName) {
char *shared_cache_base;
char *lib_base = find_lib32(task, libName, &shared_cache_base);
uint32_t fn_guest = find_function32(task, lib_base, shared_cache_base, fnName);
return (uint32_t)fn_guest;
}
修补外壳代码
现在我们已经解决了地址问题,需要将它们补丁到 shellcode 中。此外,我们还需要将各种字符串修补为库路径。
首先,我们要定义一个地址,用于放置 Stage 3 库的路径和 inj_entry 的参数地址,然后写入这些内容:
uint32_t injLibAddr = remoteCode + CODE_SIZE - 0x80;
uint32_t injLibParamsAddr = remoteStack + STACK_SIZE * 3 / 4;
// File path of injection library
kr = mach_vm_write(remoteTask, injLibAddr, (vm_offset_t)injectLib, strlen(injectLib) + 1);
if (kr != KERN_SUCCESS) return -2;
// Parameters: currently <user library path>\0<user library function>\0
kr = mach_vm_write(remoteTask, injLibParamsAddr, (vm_offset_t)lib, strlen(lib) + 1);
if (kr != KERN_SUCCESS) return -2;
kr = mach_vm_write(remoteTask, injLibParamsAddr + strlen(lib) + 1, (vm_offset_t)fn, strlen(fn) + 1);
if (kr != KERN_SUCCESS) return -2;
然后,与 shellcode 中要替换为地址的字符串相匹配的重映射列表:
struct remap {
const char *search;
uint32_t replace;
int replace_count;
};
struct remap remaps[] = {
{"0000", (uint32_t)injLibAddr, 0},
{"AAAA", (uint32_t)task_dlsym32(remoteTask, pid, "libsystem_pthread.dylib", "_pthread_create_from_mach_thread"), 0},
{"BBBB", (uint32_t)remoteCode + sc1_length, 0},
{"CCCC", (uint32_t)task_dlsym32(remoteTask, pid, "libdyld.dylib", "_dlopen"), 0},
{"DDDD", (uint32_t)task_dlsym32(remoteTask, pid, "libdyld.dylib", "_dlsym"), 0},
{"EEEE", (uint32_t)injLibParamsAddr, 0},
};
值得注意的是,这里包含
- 我们需要调用的各种外部函数
- 第 2 阶段 shellcode 的位置
- dlopen()ed 注入库字符串的位置
- inj_entry "参数的位置
然后,我们只需遍历 shellcode,根据每个重映射模式检查每个偏移量,并按照要求替换字节:
char *possiblePatchLocation = (char*)(shellcode);
for (int i = 0 ; i < sizeof(shellcode); i++) {
possiblePatchLocation++;
for (int j = 0; j < sizeof(remaps) / sizeof(*remaps); j ++) {
if (memcmp(possiblePatchLocation, remaps[j].search, 4) == 0) {
memcpy(possiblePatchLocation, &remaps[j].replace, 4);
remaps[j].replace_count ++;
}
}
}
使用 SCC 自动编写 Shellcode
在编写这个注入框架的过程中,我们需要测试大量的 shellcode。方便的是,scc 自带命令行接口,可以通过 subproccess.run()从 python 编写脚本:
import os
import subprocess
# Xcode provides us this environment variable with the root directory of the project
project_dir = os.environ["PROJECT_DIR"]
proc1 = subprocess.run(["scc", "--platform", "mac", "--arch", "x86", project_dir+"/testinj/shellcode.c", "--stdout"], stdout=subprocess.PIPE)
proc2 = subprocess.run(["scc", "--platform", "mac", "--arch", "x86", project_dir+"/testinj/shellcode2.c", "--stdout"], stdout=subprocess.PIPE)
# Pad shellcodes with interrupts, just in case
proc1_output = proc1.stdout
proc1_output = proc1_output.ljust((len(proc1_output) + 0xf) & ~0xf, b'\xcc')
proc2_output = proc2.stdout
proc2_output = proc2_output.ljust((len(proc2_output) + 0xf) & ~0xf, b'\xcc')
从这里开始,我们可以简单地将 shellcode 输出格式化为 C 风格的头文件,我们的注入进程可以#include。此外,我们还定义了一些额外的变量来辅助注入代码:
# Combine shellcode into a C-style array
shellcodes = proc1_output + proc2_output
formatted = "unsigned char shellcode[] = {" + ", ".join("0x{:02X}".format(b) for b in shellcodes) + "};\n"
# Page align
code_size = len(shellcodes) + 0x100
code_size = (code_size + 0xfff) & ~0xfff
# Write to a C header file for main.m to include
with open(project_dir + "/testinj/shellcode.h", "w") as f:
f.write(formatted)
f.write("uint32_t sc1_length = 0x{:x};\n".format(len(proc1_output)))
f.write("#define CODE_SIZE 0x{:x}\n".format(code_size))
结论
与一团糟的 macOS 相比,在 Windows 上向远程进程注入程序似乎微不足道。这个过程涉及 shellcode、被破坏的半线程,以及阅读几乎不存在的 Mach 文档约 10 个小时。最后,我们终于在最后一个支持 32 位应用程序的 macOS 版本上找到了一个注入 32 位应用程序的工具。
参考资料
源代码: GitHub