iOS 关于编译的那些事

1,205 阅读19分钟

个人认为技术学习有两种方向,一种是不断向前,了解前沿和趋势;另一种是不断向下,理解通用的底层技术和设计思想。这两种方法没有优劣之分,更应该是一种不断交替使用的方式。

最近一直在看编译器相关的内容,在不断学习底层知识的过程中有一种打通任督二脉的感觉,为此值得好好梳理一下,希望接下来这篇文章的内容对你也有帮助。

Objective-C 编译过程

在说编译之前,我们先来说明几个概念:

  • LLVM:Low Level Virtual Machine,由 Chris Lattner(Swift 作者) 用于 Objective-C 和 Swift 的编译,后来又加上了很多功能可用于常规编译器、JIT 编译器、调试器、静态分析工具等。总结来说,LLVM 是工具链技术与一个模块化和可重用的编译器的集合。
  • Clang:是 LLVM 的子项目,可以对 C、C++和 Objective-C 进行快速编译,编译速度比 GCC 快 3 倍。Clang 可以认为是 Objective-C 的编译前端,LLVM 是编译后端,前端调用后端接口完成任务。Swift 则有自己的编译前端 SIL optimizer,编译后端同样用的是 LLVM。
  • AST:抽象语法树,按照层级关系排列。
  • IR:中间语言,具有与语言无关的特性,整体结构为 Module(一个文件)--Function--Basic Block--Instruction(指令)。
  • 编译器:编译器用于把代码编译成机器码,机器码可以直接在 CPU 上面运行。好处是运行效率高,坏处是调试周期长,需要重新编译一次(OC 改完代码需要重新运行)。
  • 解释器:解释器会在运行时解释执行代码,会一段一段的把代码翻译成目标代码,然后执行目标代码。好处是具有动态性,调试及时(类似 Flutter、Playground),坏处是执行效率低。平时在调试代码的时候,使用解释器会提高效率。

为什么需要重新编译?

首先我们先来问个问题,为什么 Objective-C 代码每次修改之后,都要重新编译才能运行到机子上,而像 JavaScript 则可以做到动态调试,不需要重新编译?

答案是 Objective-C 使用编译器的方式来生成机器码,而 JavaScript 则使用解释器的方式。因为苹果公司希望 iPhone 的执行效率更高、运行速度能达到最快,所以他们选择牺牲调试周期,放弃动态性。

那我们是不是真的不能动态调试了?其实也不是,Objective-C 有一种加载动态库的机制,这就为动态调试留下了机会,具体的例子可以看这里 Injection。Injection 原理是将代码打包成动态库,如果发现代码有做更改,则使用 dlopen 重新进行动态库的加载。具体细节这里不展开,感兴趣你可以自己去看看。

编译步骤

接下来,说一下编译的过程,我将这个过程简单分成以下 3 步:

  1. 预处理:编译开始时,LLVM 会预处理你的代码,比如宏替换、头文件导入;
  2. 编译:预处理完后,LLVM 会对代码进行词法分析和语法分析(Clang),生成 AST 。AST 是抽象语法树,结构上比代码更精简,遍历起来更快,所以使用 AST 能够更快速地进行静态检查,同时还能更快地生成 IR。
  3. 生成:最后 AST 会生成 IR,IR 是一种更接近机器码的语言,区别在于和平台无关,通过 IR 可以生成多份适合不同平台的机器码。对于 iOS 系统,IR 生成的可执行文件就是 Mach-O。

我们用代码来详细解释一下这 3 个过程。

编译步骤的详细说明

预处理

新建一个工程,在 mian 中写出如下代码:

#import <Foundation/Foundation.h>
#define  DefineEight 8
int main(int argc, char * argv[]) {
    @autoreleasepool {
        int i = DefineEight;
        int j = 6;
        NSString *string = [[NSString alloc] initWithUTF8String:"clang"];
        int rank = i + j;
        NSLog(@"%@ rank %d",string, rank);
    }
    return 0;
}

编译的预处理阶段会替换宏,导入头文件等,我们来看一下上面的main.m预处理到底做了什么。在命令行中输入:

clang -E main.m

执行之后控制台输出:

# 1 "/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Foundation.framework/Headers/FoundationLegacySwiftCompatibility.h" 1 3
# 193 "/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Foundation.framework/Headers/Foundation.h" 2 3
# 10 "main.m" 2

int main(int argc, char * argv[]) {
    @autoreleasepool {
        int i = 8;
        int j = 6;
        NSString *string = [[NSString alloc] initWithUTF8String:"clang"];
        int rank = i + j;
        NSLog(@"%@ rank %d",string, rank);
    }
    return 0;
}

我们可以看到宏已经被替换修改,而头文件引入也变成了相关文件地址。

编译

词法分析

接下来是编译阶段,Clang 会先进行词法分析,在命令行输入:

clang -fmodules -fsyntax-only -Xclang -dump-tokens main.m

clang -fmodules -E -Xclang -dump-tokens main.m

打印信息如下:

annot_module_include '#import <Foundation/Foundation.h>
#define  DefineEight 8

int main(int argc, char * argv[]) {

    @autoreleasepool {
        int i = DefineEight;
        int'		Loc=<main.m:9:1>
int 'int'	 [StartOfLine]	Loc=<main.m:12:1>
identifier 'main'	 [LeadingSpace]	Loc=<main.m:12:5>
l_paren '('		Loc=<main.m:12:9>
int 'int'		Loc=<main.m:12:10>
identifier 'argc'	 [LeadingSpace]	Loc=<main.m:12:14>
comma ','		Loc=<main.m:12:18>
char 'char'	 [LeadingSpace]	Loc=<main.m:12:20>
star '*'	 [LeadingSpace]	Loc=<main.m:12:25>
identifier 'argv'	 [LeadingSpace]	Loc=<main.m:12:27>
l_square '['		Loc=<main.m:12:31>
r_square ']'		Loc=<main.m:12:32>
r_paren ')'		Loc=<main.m:12:33>
l_brace '{'	 [LeadingSpace]	Loc=<main.m:12:35>
at '@'	 [StartOfLine] [LeadingSpace]	Loc=<main.m:14:5>
identifier 'autoreleasepool'		Loc=<main.m:14:6>
l_brace '{'	 [LeadingSpace]	Loc=<main.m:14:22>
int 'int'	 [StartOfLine] [LeadingSpace]	Loc=<main.m:15:9>
identifier 'i'	 [LeadingSpace]	Loc=<main.m:15:13>
equal '='	 [LeadingSpace]	Loc=<main.m:15:15>
numeric_constant '8'	 [LeadingSpace]	Loc=<main.m:15:17 <Spelling=main.m:10:22>>
semi ';'		Loc=<main.m:15:28>
int 'int'	 [StartOfLine] [LeadingSpace]	Loc=<main.m:16:9>
identifier 'j'	 [LeadingSpace]	Loc=<main.m:16:13>
equal '='	 [LeadingSpace]	Loc=<main.m:16:15>
numeric_constant '6'	 [LeadingSpace]	Loc=<main.m:16:17>
semi ';'		Loc=<main.m:16:18>
identifier 'NSString'	 [StartOfLine] [LeadingSpace]	Loc=<main.m:17:9>
star '*'	 [LeadingSpace]	Loc=<main.m:17:18>
identifier 'string'		Loc=<main.m:17:19>
equal '='	 [LeadingSpace]	Loc=<main.m:17:26>
l_square '['	 [LeadingSpace]	Loc=<main.m:17:28>
l_square '['		Loc=<main.m:17:29>
identifier 'NSString'		Loc=<main.m:17:30>
identifier 'alloc'	 [LeadingSpace]	Loc=<main.m:17:39>
r_square ']'		Loc=<main.m:17:44>
identifier 'initWithUTF8String'	 [LeadingSpace]	Loc=<main.m:17:46>
colon ':'		Loc=<main.m:17:64>
string_literal '"clang"'		Loc=<main.m:17:65>
r_square ']'		Loc=<main.m:17:72>
semi ';'		Loc=<main.m:17:73>
int 'int'	 [StartOfLine] [LeadingSpace]	Loc=<main.m:18:9>
identifier 'rank'	 [LeadingSpace]	Loc=<main.m:18:13>
equal '='	 [LeadingSpace]	Loc=<main.m:18:18>
identifier 'i'	 [LeadingSpace]	Loc=<main.m:18:20>
plus '+'	 [LeadingSpace]	Loc=<main.m:18:22>
identifier 'j'	 [LeadingSpace]	Loc=<main.m:18:24>
semi ';'		Loc=<main.m:18:25>
identifier 'NSLog'	 [StartOfLine] [LeadingSpace]	Loc=<main.m:19:9>
l_paren '('		Loc=<main.m:19:14>
at '@'		Loc=<main.m:19:15>
string_literal '"%@ rank %d"'		Loc=<main.m:19:16>
comma ','		Loc=<main.m:19:28>
identifier 'string'		Loc=<main.m:19:29>
comma ','		Loc=<main.m:19:35>
identifier 'rank'	 [LeadingSpace]	Loc=<main.m:19:37>
r_paren ')'		Loc=<main.m:19:41>
semi ';'		Loc=<main.m:19:42>
r_brace '}'	 [StartOfLine] [LeadingSpace]	Loc=<main.m:20:5>
return 'return'	 [StartOfLine] [LeadingSpace]	Loc=<main.m:21:5>
numeric_constant '0'	 [LeadingSpace]	Loc=<main.m:21:12>
semi ';'		Loc=<main.m:21:13>
r_brace '}'	 [StartOfLine]	Loc=<main.m:22:1>
eof ''		Loc=<main.m:22:2>

Clang 在进行词法分析时,将代码切分成 一个一个Token,比如大小括号、等于号和字符串等。上面打印的信息就可以看到每个 Token ,里面有它的类型、值和位置。Clang 定义的 Token 类型,可以简单分为以下 4 类:

  1. 关键字:语法中的关键字,比如 if、else、while、for 等;
  2. 标识符:变量名;
  3. 字面量:值、数字、字符串;
  4. 特殊符号:加减乘除等符号; 所有的 Token 类型可以查看这里

语法分析

词法分析完成之后就会进行语法分析,验证语法是否正确,将输出的 Token 先按照语法组合成语义生成节点,然后将这些节点按照层级关系组成抽象语法树(AST)。 在命令行中输入:

clang -fmodules -fsyntax-only -Xclang -ast-dump main.m

输出信息如下:

TranslationUnitDecl 0x7ff3e4015608 <<invalid sloc>> <invalid sloc> <undeserialized declarations>
|-TypedefDecl 0x7ff3e4015ea0 <<invalid sloc>> <invalid sloc> implicit __int128_t '__int128'
| `-BuiltinType 0x7ff3e4015ba0 '__int128'
|-TypedefDecl 0x7ff3e4015f10 <<invalid sloc>> <invalid sloc> implicit __uint128_t 'unsigned __int128'
| `-BuiltinType 0x7ff3e4015bc0 'unsigned __int128'
|-TypedefDecl 0x7ff3e4015fb0 <<invalid sloc>> <invalid sloc> implicit SEL 'SEL *'
| `-PointerType 0x7ff3e4015f70 'SEL *' imported
|   `-BuiltinType 0x7ff3e4015e00 'SEL'
|-TypedefDecl 0x7ff3e4016098 <<invalid sloc>> <invalid sloc> implicit id 'id'
| `-ObjCObjectPointerType 0x7ff3e4016040 'id' imported
|   `-ObjCObjectType 0x7ff3e4016010 'id' imported
|-TypedefDecl 0x7ff3e4016178 <<invalid sloc>> <invalid sloc> implicit Class 'Class'
| `-ObjCObjectPointerType 0x7ff3e4016120 'Class' imported
|   `-ObjCObjectType 0x7ff3e40160f0 'Class' imported
|-ObjCInterfaceDecl 0x7ff3e40161d0 <<invalid sloc>> <invalid sloc> implicit Protocol
|-TypedefDecl 0x7ff3e4016548 <<invalid sloc>> <invalid sloc> implicit __NSConstantString 'struct __NSConstantString_tag'
| `-RecordType 0x7ff3e4016340 'struct __NSConstantString_tag'
|   `-Record 0x7ff3e40162a0 '__NSConstantString_tag'
|-TypedefDecl 0x7ff3e4052c00 <<invalid sloc>> <invalid sloc> implicit __builtin_ms_va_list 'char *'
| `-PointerType 0x7ff3e40165a0 'char *'
|   `-BuiltinType 0x7ff3e40156a0 'char'
|-TypedefDecl 0x7ff3e4052ee8 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list 'struct __va_list_tag [1]'
| `-ConstantArrayType 0x7ff3e4052e90 'struct __va_list_tag [1]' 1 
|   `-RecordType 0x7ff3e4052cf0 'struct __va_list_tag'
|     `-Record 0x7ff3e4052c58 '__va_list_tag'
|-ImportDecl 0x7ff3e42ce778 <main.m:9:1> col:1 implicit Foundation
`-FunctionDecl 0x7ff3e42cea40 <line:12:1, line:22:1> line:12:5 main 'int (int, char **)'
  |-ParmVarDecl 0x7ff3e42ce7d0 <col:10, col:14> col:14 argc 'int'
  |-ParmVarDecl 0x7ff3e42ce8f0 <col:20, col:32> col:27 argv 'char **':'char **'
  `-CompoundStmt 0x7ff3e390be60 <col:35, line:22:1>
    |-ObjCAutoreleasePoolStmt 0x7ff3e390be18 <line:14:5, line:20:5>
    | `-CompoundStmt 0x7ff3e390bde0 <line:14:22, line:20:5>
    |   |-DeclStmt 0x7ff3e42cebf0 <line:15:9, col:28>
    |   | `-VarDecl 0x7ff3e42ceb68 <col:9, line:10:22> line:15:13 used i 'int' cinit
    |   |   `-IntegerLiteral 0x7ff3e42cebd0 <line:10:22> 'int' 8
    |   |-DeclStmt 0x7ff3e5023750 <line:16:9, col:18>
    |   | `-VarDecl 0x7ff3e42cec20 <col:9, col:17> col:13 used j 'int' cinit
    |   |   `-IntegerLiteral 0x7ff3e42cec88 <col:17> 'int' 6
    |   |-DeclStmt 0x7ff3e390b5e8 <line:17:9, col:73>
    |   | `-VarDecl 0x7ff3e50237b0 <col:9, col:72> col:19 used string 'NSString *' cinit
    |   |   `-ObjCMessageExpr 0x7ff3e3820390 <col:28, col:72> 'NSString * _Nullable':'NSString *' selector=initWithUTF8String:
    |   |     |-ObjCMessageExpr 0x7ff3e5023b98 <col:29, col:44> 'NSString *' selector=alloc class='NSString'
    |   |     `-ImplicitCastExpr 0x7ff3e3820378 <col:65> 'const char * _Nonnull':'const char *' <NoOp>
    |   |       `-ImplicitCastExpr 0x7ff3e3820360 <col:65> 'char *' <ArrayToPointerDecay>
    |   |         `-StringLiteral 0x7ff3e5023c08 <col:65> 'char [6]' lvalue "clang"
    |   |-DeclStmt 0x7ff3e390bbc8 <line:18:9, col:25>
    |   | `-VarDecl 0x7ff3e390b618 <col:9, col:24> col:13 used rank 'int' cinit
    |   |   `-BinaryOperator 0x7ff3e390b720 <col:20, col:24> 'int' '+'
    |   |     |-ImplicitCastExpr 0x7ff3e390b6f0 <col:20> 'int' <LValueToRValue>
    |   |     | `-DeclRefExpr 0x7ff3e390b680 <col:20> 'int' lvalue Var 0x7ff3e42ceb68 'i' 'int'
    |   |     `-ImplicitCastExpr 0x7ff3e390b708 <col:24> 'int' <LValueToRValue>
    |   |       `-DeclRefExpr 0x7ff3e390b6b8 <col:24> 'int' lvalue Var 0x7ff3e42cec20 'j' 'int'
    |   `-CallExpr 0x7ff3e390bd60 <line:19:9, col:41> 'void'
    |     |-ImplicitCastExpr 0x7ff3e390bd48 <col:9> 'void (*)(id, ...)' <FunctionToPointerDecay>
    |     | `-DeclRefExpr 0x7ff3e390bbe0 <col:9> 'void (id, ...)' Function 0x7ff3e390b748 'NSLog' 'void (id, ...)'
    |     |-ImplicitCastExpr 0x7ff3e390bd98 <col:15, col:16> 'id':'id' <BitCast>
    |     | `-ObjCStringLiteral 0x7ff3e390bc60 <col:15, col:16> 'NSString *'
    |     |   `-StringLiteral 0x7ff3e390bc38 <col:16> 'char [11]' lvalue "%@ rank %d"
    |     |-ImplicitCastExpr 0x7ff3e390bdb0 <col:29> 'NSString *' <LValueToRValue>
    |     | `-DeclRefExpr 0x7ff3e390bc80 <col:29> 'NSString *' lvalue Var 0x7ff3e50237b0 'string' 'NSString *'
    |     `-ImplicitCastExpr 0x7ff3e390bdc8 <col:37> 'int' <LValueToRValue>
    |       `-DeclRefExpr 0x7ff3e390bcb8 <col:37> 'int' lvalue Var 0x7ff3e390b618 'rank' 'int'
    `-ReturnStmt 0x7ff3e390be50 <line:21:5, col:12>
      `-IntegerLiteral 0x7ff3e390be30 <col:12> 'int' 0

其中 TranslationUnitDecl 是根节点,表示一个编译单元;Decl 表示一个声明;Expr 表示表达式;Literal 表示字面量,是一个特殊的 Expr;Stmt 表示语句。

clang static analyzer

这个阶段我们重点说一个工具,就是 clang static analyzer。这是一个静态代码分析工具,可用于查找 C、C++和 Objective-C 程序中的 bug。clang static analyzer 包括 analyzer core 和 checker 两部分,所有的 checker 都是基于底层的 analyzer core,并且通过 analyzer core 提高的功能能够编写自己的 checker。 每执行一条语句,analyzer core 就会遍历所有 checker 中的回调函数,所以 checker 越多,语句执行速度越慢。可以通过命令行查看当前 Clang 版本下的 checker:

clang -cc1 -analyzer-checker-help

打印如下:

OVERVIEW: Clang Static Analyzer Checkers List

USAGE: -analyzer-checker <CHECKER or PACKAGE,...>

CHECKERS:
  core.CallAndMessage           Check for logical errors for function calls and Objective-C message expressions (e.g., uninitialized arguments, null function pointers)
  core.DivideZero               Check for division by zero
  core.DynamicTypePropagation   Generate dynamic type information
  core.NonNullParamChecker      Check for null pointers passed as arguments to a function whose arguments are references or marked with the 'nonnull' attribute
  core.NullDereference          Check for dereferences of null pointers
  core.StackAddressEscape       Check that addresses to stack memory do not escape the function
  core.UndefinedBinaryOperatorResult
                                Check for undefined results of binary operators
  core.VLASize                  Check for declarations of VLA of undefined or zero size
  core.uninitialized.ArraySubscript
                                Check for uninitialized values used as array subscripts
  core.uninitialized.Assign     Check for assigning uninitialized values
  core.uninitialized.Branch     Check for uninitialized values used as branch conditions
  core.uninitialized.CapturedBlockVariable
                                Check for blocks that capture uninitialized values
  core.uninitialized.UndefReturn Check for uninitialized values being returned to the caller
  cplusplus.InnerPointer        Check for inner pointers of C++ containers used after re/deallocation
  cplusplus.Move                Find use-after-move bugs in C++
  cplusplus.NewDelete           Check for double-free and use-after-free problems. Traces memory managed by new/delete.
  cplusplus.NewDeleteLeaks      Check for memory leaks. Traces memory managed by new/delete.
  cplusplus.PureVirtualCall     Check pure virtual function calls during construction/destruction
  deadcode.DeadStores           Check for values stored to variables that are never read afterwards
  nullability.NullPassedToNonnull
                                Warns when a null pointer is passed to a pointer which has a _Nonnull type.
  nullability.NullReturnedFromNonnull
                                Warns when a null pointer is returned from a function that has _Nonnull return type.
  nullability.NullableDereferenced
                                Warns when a nullable pointer is dereferenced.
  nullability.NullablePassedToNonnull
                                Warns when a nullable pointer is passed to a pointer which has a _Nonnull type.
  nullability.NullableReturnedFromNonnull
                                Warns when a nullable pointer is returned from a function that has _Nonnull return type.
  optin.cplusplus.UninitializedObject
                                Reports uninitialized fields after object construction
  optin.cplusplus.VirtualCall   Check virtual function calls during construction/destruction
  optin.mpi.MPI-Checker         Checks MPI code
  optin.osx.OSObjectCStyleCast  Checker for C-style casts of OSObjects
  optin.osx.cocoa.localizability.EmptyLocalizationContextChecker
                                Check that NSLocalizedString macros include a comment for context
  optin.osx.cocoa.localizability.NonLocalizedStringChecker
                                Warns about uses of non-localized NSStrings passed to UI methods expecting localized NSStrings
  optin.performance.GCDAntipattern
                                Check for performance anti-patterns when using Grand Central Dispatch
  optin.performance.Padding     Check for excessively padded structs.
  optin.portability.UnixAPI     Finds implementation-defined behavior in UNIX/Posix functions
  osx.API                       Check for proper uses of various Apple APIs
  osx.MIG                       Find violations of the Mach Interface Generator calling convention
  osx.NumberObjectConversion    Check for erroneous conversions of objects representing numbers into numbers
  osx.OSObjectRetainCount       Check for leaks and improper reference count management for OSObject
  osx.ObjCProperty              Check for proper uses of Objective-C properties
  osx.SecKeychainAPI            Check for proper uses of Secure Keychain APIs
  osx.cocoa.AtSync              Check for nil pointers used as mutexes for @synchronized
  osx.cocoa.AutoreleaseWrite    Warn about potentially crashing writes to autoreleasing objects from different autoreleasing pools in Objective-C
  osx.cocoa.ClassRelease        Check for sending 'retain', 'release', or 'autorelease' directly to a Class
  osx.cocoa.Dealloc             Warn about Objective-C classes that lack a correct implementation of -dealloc
  osx.cocoa.IncompatibleMethodTypes
                                Warn about Objective-C method signatures with type incompatibilities
  osx.cocoa.Loops               Improved modeling of loops using Cocoa collection types
  osx.cocoa.MissingSuperCall    Warn about Objective-C methods that lack a necessary call to super
  osx.cocoa.NSAutoreleasePool   Warn for suboptimal uses of NSAutoreleasePool in Objective-C GC mode
  osx.cocoa.NSError             Check usage of NSError** parameters
  osx.cocoa.NilArg              Check for prohibited nil arguments to ObjC method calls
  osx.cocoa.NonNilReturnValue   Model the APIs that are guaranteed to return a non-nil value
  osx.cocoa.ObjCGenerics        Check for type errors when using Objective-C generics
  osx.cocoa.RetainCount         Check for leaks and improper reference count management
  osx.cocoa.RunLoopAutoreleaseLeak
                                Check for leaked memory in autorelease pools that will never be drained
  osx.cocoa.SelfInit            Check that 'self' is properly initialized inside an initializer method
  osx.cocoa.SuperDealloc        Warn about improper use of '[super dealloc]' in Objective-C
  osx.cocoa.UnusedIvars         Warn about private ivars that are never used
  osx.cocoa.VariadicMethodTypes Check for passing non-Objective-C types to variadic collection initialization methods that expect only Objective-C types
  osx.coreFoundation.CFError    Check usage of CFErrorRef* parameters
  osx.coreFoundation.CFNumber   Check for proper uses of CFNumber APIs
  osx.coreFoundation.CFRetainRelease
                                Check for null arguments to CFRetain/CFRelease/CFMakeCollectable
  osx.coreFoundation.containers.OutOfBounds
                                Checks for index out-of-bounds when using 'CFArray' API
  osx.coreFoundation.containers.PointerSizedValues
                                Warns if 'CFArray', 'CFDictionary', 'CFSet' are created with non-pointer-size values
  security.FloatLoopCounter     Warn on using a floating point value as a loop counter (CERT: FLP30-C, FLP30-CPP)
  security.insecureAPI.DeprecatedOrUnsafeBufferHandling
                                Warn on uses of unsecure or deprecated buffer manipulating functions
  security.insecureAPI.UncheckedReturn
                                Warn on uses of functions whose return values must be always checked
  security.insecureAPI.bcmp     Warn on uses of the 'bcmp' function
  security.insecureAPI.bcopy    Warn on uses of the 'bcopy' function
  security.insecureAPI.bzero    Warn on uses of the 'bzero' function
  security.insecureAPI.decodeValueOfObjCType
                                Warn on uses of the '-decodeValueOfObjCType:at:' method
  security.insecureAPI.getpw    Warn on uses of the 'getpw' function
  security.insecureAPI.gets     Warn on uses of the 'gets' function
  security.insecureAPI.mkstemp  Warn when 'mkstemp' is passed fewer than 6 X's in the format string
  security.insecureAPI.mktemp   Warn on uses of the 'mktemp' function
  security.insecureAPI.rand     Warn on uses of the 'rand', 'random', and related functions
  security.insecureAPI.strcpy   Warn on uses of the 'strcpy' and 'strcat' functions
  security.insecureAPI.vfork    Warn on uses of the 'vfork' function
  unix.API                      Check calls to various UNIX/Posix functions
  unix.Malloc                   Check for memory leaks, double free, and use-after-free problems. Traces memory managed by malloc()/free().
  unix.MallocSizeof             Check for dubious malloc arguments involving sizeof
  unix.MismatchedDeallocator    Check for mismatched deallocators.
  unix.Vfork                    Check for proper usage of vfork
  unix.cstring.BadSizeArg       Check the size argument passed into C string functions for common erroneous patterns
  unix.cstring.NullArg          Check for null pointers being passed as arguments to C string functions
  valist.CopyToSelf             Check for va_lists which are copied onto itself.
  valist.Uninitialized          Check for usages of uninitialized (or already released) va_lists.
  valist.Unterminated           Check for va_lists which are not released by a va_end call.

通过编译的这个过程,我们就可以做很多事情了,比如自定义检查规则、自动混淆代码甚至将代码转换成另一种语言等。

生成

LLVM IR

在编译过程中,我们可以把 Clang 解析出 IR 的过程称为 LLVM Frontend,把 IR 转成目标机器码的过程称为 LLVM Backend。LLVM IR 是 Frontend 的输出,也是 Backend 的输入,前后端的桥接语言。借用戴铭大佬的一张图来说明:

前端 Clang 负责解析,验证和诊断输入代码中的错误,然后将解析的代码转换为 LLVM IR,后端 LLVM 编译把 IR 通过一系列改进代码的分析和优化,然后发送到代码生成器以生成本机机器代码。 在这里不管编译的是 Objective-C 或者是 Swift,也不管对应的硬件平台是什么类型的,LLVM 里唯一不变的就是中间语言 LLVM IR,它与何种语言开发无关。如果想开发一个新语言,只需要在完成语法解析后,通过 LLVM 提供的接口生成 IR,就可以直接在各个不同的平台上运行了。 LLVM IR 有三种表示格式:第一种 .bc 后缀,想 Bitcode 的存储格式;第二种是可读的 .ll;第三种是用于开发时操作 IR 的内存格式。接下来我们来看一下,在命令行中输入:

clang -S -emit-llvm main.m -o main.ll

在这个目录下会看到一个main.ll文件,我们用 xcode 打开它:

; ModuleID = 'main.m'
source_filename = "main.m"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"

%0 = type opaque
%struct._class_t = type { %struct._class_t*, %struct._class_t*, %struct._objc_cache*, i8* (i8*, i8*)**, %struct._class_ro_t* }
%struct._objc_cache = type opaque
%struct._class_ro_t = type { i32, i32, i32, i8*, i8*, %struct.__method_list_t*, %struct._objc_protocol_list*, %struct._ivar_list_t*, i8*, %struct._prop_list_t* }
%struct.__method_list_t = type { i32, i32, [0 x %struct._objc_method] }
%struct._objc_method = type { i8*, i8*, i8* }
%struct._objc_protocol_list = type { i64, [0 x %struct._protocol_t*] }
%struct._protocol_t = type { i8*, i8*, %struct._objc_protocol_list*, %struct.__method_list_t*, %struct.__method_list_t*, %struct.__method_list_t*, %struct.__method_list_t*, %struct._prop_list_t*, i32, i32, i8**, i8*, %struct._prop_list_t* }
%struct._ivar_list_t = type { i32, i32, [0 x %struct._ivar_t] }
%struct._ivar_t = type { i64*, i8*, i8*, i32, i32 }
%struct._prop_list_t = type { i32, i32, [0 x %struct._prop_t] }
%struct._prop_t = type { i8*, i8* }
%struct.__NSConstantString_tag = type { i32*, i32, i8*, i64 }

@"OBJC_CLASS_$_NSString" = external global %struct._class_t
@"OBJC_CLASSLIST_REFERENCES_$_" = internal global %struct._class_t* @"OBJC_CLASS_$_NSString", section "__DATA,__objc_classrefs,regular,no_dead_strip", align 8
@.str = private unnamed_addr constant [6 x i8] c"clang\00", align 1
@OBJC_METH_VAR_NAME_ = private unnamed_addr constant [20 x i8] c"initWithUTF8String:\00", section "__TEXT,__objc_methname,cstring_literals", align 1
@OBJC_SELECTOR_REFERENCES_ = internal externally_initialized global i8* getelementptr inbounds ([20 x i8], [20 x i8]* @OBJC_METH_VAR_NAME_, i32 0, i32 0), section "__DATA,__objc_selrefs,literal_pointers,no_dead_strip", align 8
@__CFConstantStringClassReference = external global [0 x i32]
@.str.1 = private unnamed_addr constant [11 x i8] c"%@ rank %d\00", section "__TEXT,__cstring,cstring_literals", align 1
@_unnamed_cfstring_ = private global %struct.__NSConstantString_tag { i32* getelementptr inbounds ([0 x i32], [0 x i32]* @__CFConstantStringClassReference, i32 0, i32 0), i32 1992, i8* getelementptr inbounds ([11 x i8], [11 x i8]* @.str.1, i32 0, i32 0), i64 10 }, section "__DATA,__cfstring", align 8 #0
@llvm.compiler.used = appending global [3 x i8*] [i8* bitcast (%struct._class_t** @"OBJC_CLASSLIST_REFERENCES_$_" to i8*), i8* getelementptr inbounds ([20 x i8], [20 x i8]* @OBJC_METH_VAR_NAME_, i32 0, i32 0), i8* bitcast (i8** @OBJC_SELECTOR_REFERENCES_ to i8*)], section "llvm.metadata"

; Function Attrs: noinline optnone ssp uwtable
define i32 @main(i32, i8**) #1 {
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  %5 = alloca i8**, align 8
  %6 = alloca i32, align 4
  %7 = alloca i32, align 4
  %8 = alloca %0*, align 8
  %9 = alloca i32, align 4
  store i32 0, i32* %3, align 4
  store i32 %0, i32* %4, align 4
  store i8** %1, i8*** %5, align 8
  %10 = call i8* @llvm.objc.autoreleasePoolPush() #2
  store i32 8, i32* %6, align 4
  store i32 6, i32* %7, align 4
  %11 = load %struct._class_t*, %struct._class_t** @"OBJC_CLASSLIST_REFERENCES_$_", align 8
  %12 = bitcast %struct._class_t* %11 to i8*
  %13 = call i8* @objc_alloc(i8* %12)
  %14 = bitcast i8* %13 to %0*
  %15 = load i8*, i8** @OBJC_SELECTOR_REFERENCES_, align 8, !invariant.load !9
  %16 = bitcast %0* %14 to i8*
  %17 = call i8* bitcast (i8* (i8*, i8*, ...)* @objc_msgSend to i8* (i8*, i8*, i8*)*)(i8* %16, i8* %15, i8* getelementptr inbounds ([6 x i8], [6 x i8]* @.str, i64 0, i64 0))
  %18 = bitcast i8* %17 to %0*
  store %0* %18, %0** %8, align 8
  %19 = load i32, i32* %6, align 4
  %20 = load i32, i32* %7, align 4
  %21 = add nsw i32 %19, %20
  store i32 %21, i32* %9, align 4
  %22 = load %0*, %0** %8, align 8
  %23 = load i32, i32* %9, align 4
  notail call void (i8*, ...) @NSLog(i8* bitcast (%struct.__NSConstantString_tag* @_unnamed_cfstring_ to i8*), %0* %22, i32 %23)
  call void @llvm.objc.autoreleasePoolPop(i8* %10)
  ret i32 0
}

; Function Attrs: nounwind
declare i8* @llvm.objc.autoreleasePoolPush() #2

declare i8* @objc_alloc(i8*)

; Function Attrs: nonlazybind
declare i8* @objc_msgSend(i8*, i8*, ...) #3

declare void @NSLog(i8*, ...) #4

; Function Attrs: nounwind
declare void @llvm.objc.autoreleasePoolPop(i8*) #2

attributes #0 = { "objc_arc_inert" }
attributes #1 = { noinline optnone ssp uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "darwin-stkchk-strong-link" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "probe-stack"="___chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cx16,+cx8,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #2 = { nounwind }
attributes #3 = { nonlazybind }
attributes #4 = { "correctly-rounded-divide-sqrt-fp-math"="false" "darwin-stkchk-strong-link" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "probe-stack"="___chkstk_darwin" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cx16,+cx8,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0, !1, !2, !3, !4, !5, !6, !7}
!llvm.ident = !{!8}

!0 = !{i32 2, !"SDK Version", [3 x i32] [i32 10, i32 15, i32 4]}
!1 = !{i32 1, !"Objective-C Version", i32 2}
!2 = !{i32 1, !"Objective-C Image Info Version", i32 0}
!3 = !{i32 1, !"Objective-C Image Info Section", !"__DATA,__objc_imageinfo,regular,no_dead_strip"}
!4 = !{i32 4, !"Objective-C Garbage Collection", i32 0}
!5 = !{i32 1, !"Objective-C Class Properties", i32 64}
!6 = !{i32 1, !"wchar_size", i32 4}
!7 = !{i32 7, !"PIC Level", i32 2}
!8 = !{!"Apple clang version 11.0.3 (clang-1103.0.32.29)"}
!9 = !{}

LLVM IR 的指令介绍:

  • %:局部变量
  • @:全局变量
  • alloca:为当前执行的函数分配内存,当该函数执行完毕时自动释放内存
  • i32:整数占位,i32 就代办 32 位
  • align:对齐,向 4 个字节对齐,即便数据没有占用 4 个字节,也要为其分配 4 个字节
  • load:读出
  • store:写入
  • icmp:两个整数值比较,返回布尔值
  • br:选择分支,根据条件跳转对应的 label
  • label:代码标签
  • call:调用

LLVM Backend

完整的 LLVM Backend 阶段又可以称为 CodeGen,整个过程可以分成以下几个阶段:

  • Instruction Selection:指令选择,将 IR 转化成由目标平台指令组成的 DAG,选择能完成指定操作且执行时间最短的指令;
  • Scheduling and Formation:调度与排序,读取 DAG,将 DAG 的指令排成 MachineInstr 队列。根据指令间的依赖进行指令重排,使其更好地利用 CPU 的功能单元;
  • SSA 优化:由多个基于 SSA 的 Pass 组成;
  • Register allocation:寄存器分配,将 Virtual Register 映射到 Physical Register 或者内存地址上;
  • Prolog/Epilo 的生成:函数体指令生成后,就可以确定函数所需要的堆栈大小了;
  • Machine Code:机器码晚期优化,这是最后一次进行优化的机会;
  • Code Emission:代码发射,输出代码,可以选择输出汇编程序或二进制机器码;

编译完成生成的文件

编译完成之后会生成一些文件,我们主要介绍一下其中三种:

  • 二进制内容 Link Map File
  • dSYM 文件
  • Mach-O

Link Map File

Link Map FIle 文件内容包含三个部分:

  • Object file:.m 文件编译后的 .o 文件和需要连接的 .a 文件,包括文件编号和文件路径;
  • Section:描述每个 Section 在可执行文件中的位置和大小;
  • Symbol:Symbol 对 Section 进行了再划分,描述了所以的 method、ivar、string,以及它们对应的 address、size 和 file number 信息; 可以在 Build Settings 中查看路径,如下图:

如果你想要使用二进制重排提升 APP 的启动速度,那么你就会用到 Link Map File 了,相关的文章可以看这篇:iOS 优化篇 - 启动优化之Clang插桩实现二进制重排

dSYM 文件

dSYM 文件里面存储了函数地址映射的信息,所以调用栈的地址就可以通过 dSYM 映射表获得具体的函数信息。它通常可以用来做 crash 的文件符号化,将 crash 时候保存的调用栈信息转化为相应的函数。这就是为什么友盟或者 bugly,都需要你上传 dSYM 文件的原因。

Mach-O

Mach-O 文件是用于记录编译后的可执行文件、对象代码、共享库、动态加载代码和内存转储的文件格式。Mach-O 里面的 _CodeSignature 包含了程序代码的签名,这个签名就是为了保证里面的文件不能被直接更改。不过如果你有用过企业版重签名的功能,就会发现其实还是有些东西可以改的,只不过改完之后要重新生成签名文件。 Mach-O 文件里面主要包含三个部分:

  • Mach-O Header:包含字节顺序、魔数、CPU 类型、加载指令的数量等;
  • Load Command:包括区域的位置、符号表、动态符号表等。每个加载指令包含一个元信息,比如指令类型、名称以及在二进制中的位置等;
  • Data:内容最多的部分,包含了代码、数据。比如符号表、动态符号表等;

总结

接下来我们来总结一下:

  1. iOS 使用编译器而不是解释器来处理代码,为了获得更快的运行速度;
  2. iOS 的编译过程是通过 LLVM 编译工具生成语法树 AST,再把 AST 转换成 IR,最后把 IR 生成平台的机器码。
  3. 编译成功生成的几个文件 Link Map File、dSYM 和 Mach-O,我们也做了相关介绍。 学习知识是为了应用的,后面几篇文章我会围绕“运用编译知识处理现实问题”的角度来写,敬请期待。

推荐学习

iOS 开发高手课

跟戴铭学iOS编程:理顺核心知识点