Mach-O文件结构:从结果看开发过程

994 阅读6分钟

个人认为,事务的分析可以从三个方向来考虑:设计角度、过程角度和结果角度。
从App开发来看的话,设计角度是指站在App设计的角度上,分析需求、细化功能模块、设计功能模块之间的联系;过程角度是站在开发者的角度上,执行的具体的开发过程;而结果角度是指站在第三方的角度上,分析编译产物的过程。
本篇文章作为Mach-O文件入门文章,主要介绍Mach-O的文件结构。

Mach-O(Mach Object)文件格式,它是一种用于可运行文件目标代码动态库内核转储的文件格式。作为.out格式的替代方案,Mach-O提供了更强的扩展性,并提升了符号表中信息的访问速度。

它主要用于OS X和iOS系统,类似于Windows的PE格式或者Linux的ELF格式。

了解Mach-O文件格式,有助于理解Xcode如何基于Mach-O运行,有助于理解更加底层debug方式。

文件类型

  • Executable:应用程序的二进制文件
  • Dylib:动态库。类似于Windows的动态链接库(DLL)
  • Bundle:一种不会进行链接的动态库。只能在程序运行时,通过dlopen()打开,比如:Mac OS的插件
  • Image:镜像。用来代指Executable、Dylib、Bundle类型
  • Framework:带有资源和头文件的动态库

文件结构

Mach-O文件由Header、Load commands、Segment组成。下图是苹果提供的Mach-O结构图:

Mach-O基础架构图

图出自苹果的osx-abi-macho-file-format-reference,但苹果官网已经找不到相关地址。可以参考github上copy版本

本文以arm64结构为探讨对象。

下图是Mach-O文件的具体结构图:

Mach-O文件预览图

Header

Mach-O文件的开头就是Header。她用来描述文件类型、文件支持的架构、Load commands个数以及空间大小等信息。

文件位置:#import <mach-o/loader.h>

/*
 * The 64-bit mach header appears at the very beginning of object files for
 * 64-bit architectures.
 */
struct mach_header_64 {
	uint32_t	magic;		/* mach magic number identifier */
	cpu_type_t	cputype;	/* cpu specifier */
	cpu_subtype_t	cpusubtype;	/* machine specifier */
	uint32_t	filetype;	/* type of file */
	uint32_t	ncmds;		/* number of load commands */
	uint32_t	sizeofcmds;	/* the size of all the load commands */
	uint32_t	flags;		/* flags */
	uint32_t	reserved;	/* reserved */
};

字段信息很多都是见名知意的。比较特殊的是filetype字段,比较常用的格式为MH_EXECUTE MH_DYLIB MH_DYLINKER MH_BUNDLE。以下为详细的可选值。

文件位置:#import <mach-o/loader.h>

#define	MH_OBJECT	0x1		/* relocatable object file */
#define	MH_EXECUTE	0x2		/* demand paged executable file */
#define	MH_FVMLIB	0x3		/* fixed VM shared library file */
#define	MH_CORE		0x4		/* core file */
#define	MH_PRELOAD	0x5		/* preloaded executable file */
#define	MH_DYLIB	0x6		/* dynamically bound shared library */
#define	MH_DYLINKER	0x7		/* dynamic link editor */
#define	MH_BUNDLE	0x8		/* dynamically bound bundle file */
#define	MH_DYLIB_STUB	0x9		/* shared library stub for static */
					/*  linking only, no section contents */
#define	MH_DSYM		0xa		/* companion file with only debug */
					/*  sections */
#define	MH_KEXT_BUNDLE	0xb		/* x86_64 kexts */

Load commands

从文件结构上看,Header后紧跟的就是Load commands。主要作用是描述文件布局和链接信息。可以把她看做是存储着struct load_command类型的数组。

具体功能如下:

  • 文件在虚拟内存中的初始布局
  • 符号表的位置
  • 主程序引用的共享库

struct load_command类型

文件位置:#import <mach-o/loader.h>

/*
 * The load commands directly follow the mach_header.  The total size of all
 * of the commands is given by the sizeofcmds field in the mach_header.  All
 * load commands must have as their first two fields cmd and cmdsize.  The cmd
 * field is filled in with a constant for that command type.  Each command type
 * has a structure specifically for it.  The cmdsize field is the size in bytes
 * of the particular load command structure plus anything that follows it that
 * is a part of the load command (i.e. section structures, strings, etc.).  To
 * advance to the next load command the cmdsize can be added to the offset or
 * pointer of the current load command.  The cmdsize for 32-bit architectures
 * MUST be a multiple of 4 bytes and for 64-bit architectures MUST be a multiple
 * of 8 bytes (these are forever the maximum alignment of any load commands).
 * The padded bytes must be zero.  All tables in the object file must also
 * follow these rules so the file can be memory mapped.  Otherwise the pointers
 * to these tables will not work well or at all on some machines.  With all
 * padding zeroed like objects will compare byte for byte.
 */
struct load_command {
	uint32_t cmd;		/* type of load command */
	uint32_t cmdsize;	/* total size of command in bytes */
};

从注释中,可以发现cmd不仅决定了决定了Load command的类型,并且决定了Load command的整体结构,也就是说struct load_command只是基类结构体,真实的类型需要会根据cmd的数值,匹配真实的数据结构。

比如:比较常用的类型LC_SEGMENT_64(下边会说__Text、__DATA、__LINKEDIT类型就是这种类型),她的数据结构为

文件位置:#import <mach-o/loader.h>

/*
 * The 64-bit segment load command indicates that a part of this file is to be
 * mapped into a 64-bit task's address space.  If the 64-bit segment has
 * sections then section_64 structures directly follow the 64-bit segment
 * command and their size is reflected in cmdsize.
 */
struct segment_command_64 { /* for 64-bit architectures */
	uint32_t	cmd;		/* LC_SEGMENT_64 */
	uint32_t	cmdsize;	/* includes sizeof section_64 structs */
	char		segname[16];	/* segment name */
	uint64_t	vmaddr;		/* memory address of this segment */
	uint64_t	vmsize;		/* memory size of this segment */
	uint64_t	fileoff;	/* file offset of this segment */
	uint64_t	filesize;	/* amount to map from the file */
	vm_prot_t	maxprot;	/* maximum VM protection */
	vm_prot_t	initprot;	/* initial VM protection */
	uint32_t	nsects;		/* number of sections in segment */
	uint32_t	flags;		/* flags */
};

其它cmd类型的信息可以查阅loader.h文件,可以关注以LC_开头的宏定义。

对于LC_SEGMENT_64类型,需要特别注意。那就是她后边可以跟若干struct section_64,用来指明包含Section信息。在这种情况下cmdsize不仅表明struct segment_command_64占用的字节长度,还包含这些Section信息的长度;另外就是nsects字段标识了Section的具体个数。

Section的数据结构为:

文件位置:#import <mach-o/loader.h>

/*
 * A segment is made up of zero or more sections.  Non-MH_OBJECT files have
 * all of their segments with the proper sections in each, and padded to the
 * specified segment alignment when produced by the link editor.  The first
 * segment of a MH_EXECUTE and MH_FVMLIB format file contains the mach_header
 * and load commands of the object file before its first section.  The zero
 * fill sections are always last in their segment (in all formats).  This
 * allows the zeroed segment padding to be mapped into memory where zero fill
 * sections might be. The gigabyte zero fill sections, those with the section
 * type S_GB_ZEROFILL, can only be in a segment with sections of this type.
 * These segments are then placed after all other segments.
 *
 * The MH_OBJECT format has all of its sections in one segment for
 * compactness.  There is no padding to a specified segment boundary and the
 * mach_header and load commands are not part of the segment.
 *
 * Sections with the same section name, sectname, going into the same segment,
 * segname, are combined by the link editor.  The resulting section is aligned
 * to the maximum alignment of the combined sections and is the new section's
 * alignment.  The combined sections are aligned to their original alignment in
 * the combined section.  Any padded bytes to get the specified alignment are
 * zeroed.
 *
 * The format of the relocation entries referenced by the reloff and nreloc
 * fields of the section structure for mach object files is described in the
 * header file <reloc.h>.
 */
struct section_64 { /* for 64-bit architectures */
	char		sectname[16];	/* name of this section */
	char		segname[16];	/* segment this section goes in */
	uint64_t	addr;		/* memory address of this section */
	uint64_t	size;		/* size in bytes of this section */
	uint32_t	offset;		/* file offset of this section */
	uint32_t	align;		/* section alignment (power of 2) */
	uint32_t	reloff;		/* file offset of relocation entries */
	uint32_t	nreloc;		/* number of relocation entries */
	uint32_t	flags;		/* flags (section type and attributes)*/
	uint32_t	reserved1;	/* reserved (for offset or index) */
	uint32_t	reserved2;	/* reserved (for count or sizeof) */
	uint32_t	reserved3;	/* reserved */
};

Segment

Mach-O中,除了上边说的Header和Load command以外,剩下的就是Segment的。

Segment的对齐方式必须是内存的page size的整数倍。主要和虚拟内存的使用有关。
arm64为16KB
其它平台下为4KB

常用的Segment

__TEXT:代码段,保存了程序二进制数据。只有只读权限。
__DATA:数据段,保存了全局变量、静态变量。具有可读可写权限。
__LINKEDIT:保存了加载数据的方式。有符号表、间接符号表、字符串表等。

参考资料