Linux Kernel 源码学习：PER_CPU 变量、swapgs及栈切换（二）说明：在 Linux Kernel

说明：

本文采用Linux 内核 v3.10 版本 x86_64架构

本文不涉及调试、跟踪及异常处理的细节

在 Linux Kernel 源码学习：PER_CPU 变量、swapgs及栈切换（一）里，我们学习了per-cpu变量的初始化过程，本篇我们来学习 GS 寄存器的初始化过程。

一、段寄存器

x86_64架构中，每个处理器拥有 6 个段寄存器，每个段寄存器有一个可见部分和一个隐藏部分。隐藏部分又被称作”描述符缓存“或者”影子寄存器“。可见部分是一个 16 位寄存器，存储的是段选择子；隐藏部分包含段基地址，段限制和访问控制信息三个部分。

当一个段选择子被载入段寄存器的可见部分时，处理器也会把段选择子指向的段描述符中的基地址，段限制及访问控制信息加载到段寄存器的隐藏部分。

二、段寄存器加载指令

2.1 段加载指令分类

段寄存器有2类加载指令：

直接加载指令

这些指令指出了所使用的段寄存器，包括 MOV、 POP、 LDS、 LES、LSS、 LGS 和 LFS 指令，。

隐式加载指令

这些指令会隐式修改 CS 寄存器的内容，包括跳转及调用指令 CALL、 JMP 和 RET，系统调用指令 SYSENTER 和 SYSEXIT ，中断相关指令 IRET、 INT n、 INTO、 INT3 及 INT1 等。

2.2 64位模式下的段加载指令

在64位模式下，由于 ES、 DS 和 SS 段寄存器没有使用，它们在段描述符寄存器中的字段（base、limit 和 attribute）会被忽略。当使用 ES、 DS 和 SS 段计算地址时，这些段的基地址被当做 0 处理。

2.2.1 MOV 指令及其限制

在 64 位模式下，当使用 FS 和 GS 寄存器前缀时，它们的基地址会用作线性地址计算。为了保证与低于 64 位系统的兼容性，普通段加载指令（例如 MOV to Sreg 或 POP Sreg）只能将 32 位的基地址加载到 FS 或 GS 寄存器的隐藏部分，超出 32 位的部分会被清零。也就是说，使用 MOV 指令，无法给 FS 和 GS 寄存器加载 64 位的完整基地址。

2.2.2 WRMSR 指令

WRMSR 指令会把 EDX:EAX 寄存器中的内容写入 64 位的 MSR （model specific register）寄存器中，具体写入哪个寄存器，需要在 ECX 寄存器中指定。EDX 寄存器的内容被复制到 MSR 寄存器的高 32 位，而 EAX 寄存器中的内容被复制到 MSR 寄存器的低 32 位。

为了能在 64 位系统中，在 FS 和 GS 影子寄存器中加载全部 64 位的基地址，影子寄存器中的 FS.base 和 GS.base 字段被物理映射到 2 个 MSRs，它们分别是：IA32_FS_BASE 和 IA32_GS_BASE。这 2 个 MSR 寄存器都是 64 位的，可以使用 WRMSR 指令向这个 2 个 MSR 寄存器写入地址，相当于直接向影子寄存器的 FS.base 和 GS.base 写入了数据。

2.2.3 SWAPGS 指令

在 64 位模式下，新增了一个 GS 基地址加载指令-- SWAPGS。SWAPGS 指令交换 MSR 寄存器 IA32_KERNEL_GS_BASE（该寄存器保存着内核数据结构的指针）与 GS 基址寄存器 GS.base（即 IA32_GS_BASE ）中的值。交换后，内核就可以使用 GS 前缀来访问内核数据结构了。

2.2.4 MSRs

Register Address	Architectural MSR Name / Bit Fields	MSR/Bit Description	Comment
C000_0100H	IA32_FS_BASE	Map of BASE Address of FS (R/W)	If CPUID.80000001:EDX.[29] = 1
C000_0101H	IA32_GS_BASE	Map of BASE Address of GS (R/W)	If CPUID.80000001:EDX.[29] = 1
C000_0102H	IA32_KERNEL_GS_BASE	Swap Target of BASE Address of GS (R/W)	If CPUID.80000001:EDX.[29] = 1

关于段寄存器的相关知识，可参考 Intel 64 and IA-32 Architectures Software Developer Manuals（以下简称 Intel SDM ） Vol. 3A 第 3.4.3 节及3.4.4 节。

三、GS寄存器初始化

3.1 BSP 和 AP

在介绍 GS 寄存器的初始化过程之前，我们先来了解一个概念：BSP 和 AP。

现代服务器大多是MP（Multiple-Processor）系统，根据 Intel 的规范，当MP进行初始化时，需要遵守 MP 初始化协议，该协议把处理器分为两类：BSP（ bootstrap processor）和 AP（ application processors）。当 MP 系统给电或者重启时，系统硬件会动态的选择一个处理器作为 BSP，其余处理器都被当做 AP。

BSP 执行 BIOS 中的启动代码来配置 APIC （Advanced Programmable Interrupt Controller）环境，设置系统端数据结构，启动并初始化APs。当 BSP 和 APs 全部初始化完成后，BSP 才会开始执行操作系统的初始化代码。

MP系统的初始化细节，可参考 Intel SDM Vol. 3A 第 9.4节 MULTIPLE-PROCESSOR (MP) INITIALIZATION，下面是部分引用内容。

9.4.1 BSP and AP Processors

The MP initialization protocol defines two classes of processors: the bootstrap processor (BSP) and the application processors (APs). Following a power-up or RESET of an MP system, system hardware dynamically selects one of the processors on the system bus as the BSP. The remaining processors are designated as APs.

As part of the BSP selection mechanism, the BSP flag is set in the IA32_APIC_BASE MSR (see Figure 11-5) of the BSP, indicating that it is the BSP. This flag is cleared for all other processors.

The BSP executes the BIOS’s boot-strap code to configure the APIC environment, sets up system-wide data structures, and starts and initializes the APs. When the BSP and APs are initialized, the BSP then begins executing the operating-system initialization code.

Following a power-up or reset, the APs complete a minimal self-configuration, then wait for a startup signal (a SIPI message) from the BSP processor. Upon receiving a SIPI message, an AP executes the BIOS AP configuration code, which ends with the AP being placed in halt state.

For Intel 64 and IA-32 processors supporting Intel Hyper-Threading Technology, the MP initialization protocol treats each of the logical processors on the system bus or coherent link domain as a separate processor (with a unique APIC ID). During boot-up, one of the

好了，概念介绍完了，下面我们来看下BSP的初始化过程。

3.2 BSP中GS寄存器初始化

在上一篇文章中，我们讲解了setup_per_cpu_areas函数。BSP中 GS 寄存器的初始化工作，同样是在该函数中完成的。

在setup_per_cpu_areas函数中，通过for_each_possible_cpu(cpu)遍历 cpu 时，除了为各 cpu 拷贝一份 per-cpu 数据之外，还会判断该 cpu 是否为 BSP（BSP的cpu编号为0），如果是的话，就会调用switch_to_new_gdt(cpu);函数。

// file: arch/x86/kernel/setup_percpu.c
void __init setup_per_cpu_areas(void)
{
	......
	
	for_each_possible_cpu(cpu) {
	
		......
		
		/*
		 * Up to this point, the boot CPU has been using .init.data
		 * area.  Reload any changed state for the boot CPU.
		 */
		if (!cpu)
			switch_to_new_gdt(cpu);
	}
	
	......
	
}

switch_to_new_gdt(cpu);函数实现如下：

// file: arch/x86/kernel/cpu/common.c
/*
 * Current gdt points %fs at the "master" per-cpu area: after this,
 * it's on the real one.
 */
void switch_to_new_gdt(int cpu)
{
	
    ......
        
	/* Reload the per-cpu base */

	load_percpu_segment(cpu);
}

3.2.1 load_percpu_segment

从注释中也可以看到，通过 load_percpu_segment(cpu) 函数，重载了per-cpu 的基地址。load_percpu_segment函数实现如下：

// file: arch/x86/kernel/cpu/common.c
void load_percpu_segment(int cpu)
{
#ifdef CONFIG_X86_32
	loadsegment(fs, __KERNEL_PERCPU);
#else
	loadsegment(gs, 0);
	wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));
#endif
	load_stack_canary_segment();
}

由于我们是 x86_64 系统，并未配置内核参数 CONFIG_X86_32 ，所以会走到 else 分支。在 else 分支中，首先通过 loadsegment(gs, 0)将 gs 寄存器设置为0；然后通过 wrmsrl指令将 per-cpu 变量 irq_stack_union.gs_base 写入 MSR_GS_BASE寄存器，其中变量 irq_stack_union.gs_base的值就是 per-cpu 区域的基地址。

3.2.2 irq_stack_union

为什么说 irq_stack_union.gs_base的值就是 per-cpu 区域的基地址呢？我们先来看看该变量的创建：

// file: arch/x86/kernel/cpu/common.c
DEFINE_PER_CPU_FIRST(union irq_stack_union,
		     irq_stack_union) __aligned(PAGE_SIZE);

// file: include/linux/percpu-defs.h
#define DEFINE_PER_CPU_FIRST(type, name)				\
	DEFINE_PER_CPU_SECTION(type, name, PER_CPU_FIRST_SECTION)

#define PER_CPU_FIRST_SECTION "..first"

可以看到，DEFINE_PER_CPU_FIRST 宏内部引用了 DEFINE_PER_CPU_SECTION宏，同时指定了section 名称 "..first"。DEFINE_PER_CPU_SECTION宏我们在 Linux Kernel 源码学习：PER_CPU 变量、swapgs及栈切换（一）中详细剖析过，本文不再赘述。最终，变量irq_stack_union 会被编译到 section(.data..percpu..first) 中。

在链接阶段，section(.data..percpu..first) 的数据及其它 per-cpu 相关 section 中的数据，被统一输出到 section(.data..percpu) 中，而 section(.data..percpu..first) 被链接到 section(.data..percpu) 中的最低地址处。也就是说，链接后 section(.data..percpu..first) 的起始地址，就是 section(.data..percpu) 的起始地址，也是整个 per-cpu 区域的起始地址。

而在 section(.data..percpu..first) 中，只有一个变量，那就是 irq_stack_union，其定义如下：

// file: arch/x86/include/asm/processor.h
union irq_stack_union {
	char irq_stack[IRQ_STACK_SIZE];
	/*
	 * GCC hardcodes the stack canary as %gs:40.  Since the
	 * irq_stack is the object at %gs:0, we reserve the bottom
	 * 48 bytes of the irq stack for the canary.
	 */
	struct {
		char gs_base[40];
		unsigned long stack_canary;
	};
};

irq_stack_union 是一个联合体，irq_stack_union.gs_base[] 是个数组，irq_stack_union.gs_base表示数组的地址，数值上等同于 irq_stack_union 变量的地址，所以该地址实际就是 section(.data..percpu) 的起始地址。

per_cpu(irq_stack_union.gs_base, cpu) 会获取到对应 cpu 下irq_stack_union.gs_base的地址，也就是 __per_cpu_offset[cpu] 中保存的值。详细推导过程参见前文 Linux Kernel 源码学习：PER_CPU 变量、swapgs及栈切换（一）。

3.2.3 实现过程

loadsegment 是一个宏，其实现如下：

// file: arch/x86/include/asm/segment.h
/*
 * Load a segment. Fall back on loading the zero
 * segment if something goes wrong..
 */
#define loadsegment(seg, value)						\
do {									\
	unsigned short __val = (value);					\
									\
	asm volatile("						\n"	\
		     "1:	movl %k0,%%" #seg "		\n"	\
									\
		     ".section .fixup,\"ax\"			\n"	\
		     "2:	xorl %k0,%k0			\n"	\
		     "		jmp 1b				\n"	\
		     ".previous					\n"	\
									\
		     _ASM_EXTABLE(1b, 2b)				\
									\
		     : "+r" (__val) : : "memory");			\
} while (0)

该宏中，又引用了宏_ASM_EXTABLE，其定义如下：

// file: arch/x86/include/asm/asm.h
# define _ASM_EXTABLE(from,to)					\
	" .pushsection \"__ex_table\",\"a\"\n"			\
	" .balign 8\n"						\
	" .long (" #from ") - .\n"				\
	" .long (" #to ") - .\n"				\
	" .popsection\n"

__ex_table 是一个特殊节，该节存储了一对地址，其中 from 处是正常代码地址， to 处是异常处理代码地址。当内核执行 from 处的代码时，如果出现异常，就会去执行 to 处的代码。__ex_table 段需要和 .fixup 段配合使用，其中 .fixup 节内定义了异常处理代码。关于__ex_table的知识，可参考内核文档：exception-tables。本文不会讲解异常处理的细节，删除掉异常处理代码后， loadsegment(seg, value) 扩展如下：

do {									\
	unsigned short __val = (value);					\
									\
	asm volatile("						\n"	\
		     "1:	movl %k0,%%" #seg "		\n"	\
                 
			 ......						

		     : "+r" (__val) : : "memory");			\
} while (0)

所以正常执行时，把入参 value 转换成 unsigned short 类型的 __val 后，只会执行标签 1 处的代码，将 __val 的值传送到指定的寄存器中。其中 %k0中的 k 是半模修饰符，表示只取对应寄存器的低 32 位，比如当使用 a 系列寄存器时，实际使用的是 %eax寄存器。gcc 官方文档对 k 修饰符的说明如下：

Modifier	Description	Operand	‘att’
k	Print the SImode name of the register.	%k0	%eax

完整的修饰符列表可以参考 GCC 官方文档 Extended Asm:x86 operand modifiers。

通过分析 loadsegment(seg, value)的实现，我们知道loadsegment(gs, 0)会把 gs 寄存器设置为0。

接下来，我们分析wrmsrl的执行过程。

wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));

wrmsrl是一个宏，其定义如下：

// file: arch/x86/include/asm/paravirt.h
#define wrmsrl(msr, val)	wrmsr(msr, (u32)((u64)(val)), ((u64)(val))>>32)

wrmsrl宏接收2个参数，分别是要写入的 MSR 寄存器编号以及要写入的值。宏 MSR_GS_BASE定义了 MSR 寄存器 IA32_GS_BASE 的编号值，其定义如下：

// file: arch/x86/include/uapi/asm/msr-index.h
#define MSR_GS_BASE		0xc0000101 /* 64bit GS base */

另外，根据 2.2.2 节讲解内容，wrmsr 指令会把 EDX:EAX 寄存器中内容加载到指定 MSR，这 2 个寄存器都是 32 位的，所以可以看到，在其内部调用 wrmsr宏的时候，把入参 val 分割成了低 32 位和高 32 位 2 个值，其中 (u32)((u64)(val)) 为低 32 位值， ((u64)(val))>>32为高 32 位值。

wrmsr宏定义如下，其内部又调用了同文件下的内联函数 paravirt_write_msr：

// file: arch/x86/include/asm/paravirt.h
#define wrmsr(msr, val1, val2)			\
do {						\
	paravirt_write_msr(msr, val1, val2);	\
} while (0)

static inline int paravirt_write_msr(unsigned msr, unsigned low, unsigned high)
{
	return PVOP_CALL3(int, pv_cpu_ops.write_msr, msr, low, high);
}

PVOP_CALL3宏涉及到半虚拟化的知识，本文不涉及。PVOP_CALL3宏是对执行函数 pv_cpu_ops.write_msr的封装，我们直接来看这个函数。pv_cpu_ops是同名结构体的一个实例，该结构体定义在头文件 arch/x86/include/asm/paravirt_types.h 中，其实现定义在 arch/x86/kernel/paravirt.c 文件中：

// file: arch/x86/kernel/paravirt.c
struct pv_cpu_ops pv_cpu_ops = {
    ...
        
    .write_msr = native_write_msr_safe,
    
    ...
}

native_write_msr_safe 函数定义如下：

// file: arch/x86/include/asm/msr.h
/* Can be uninlined because referenced by paravirt */
notrace static inline int native_write_msr_safe(unsigned int msr,
					unsigned low, unsigned high)
{
	int err;
	asm volatile("2: wrmsr ; xor %[err],%[err]\n"
		     "1:\n\t"
		     ".section .fixup,\"ax\"\n\t"
		     "3:  mov %[fault],%[err] ; jmp 1b\n\t"
		     ".previous\n\t"
		     _ASM_EXTABLE(2b, 3b)
		     : [err] "=a" (err)
		     : "c" (msr), "0" (low), "d" (high),
		       [fault] "i" (-EIO)
		     : "memory");
	return err;
}

可以看到，这是一段内联汇编的代码，其中还涉及到异常处理的流程。关于异常处理及异常表，上文简单介绍过，不是我们的重点。把异常处理部分去掉之后，实际执行代码如下：

notrace static inline int native_write_msr_safe(unsigned int msr,
					unsigned low, unsigned high)
{
	int err;
	asm volatile("2: wrmsr ; xor %[err],%[err]\n"
             int err;
                 
		     ...
                 
		     : [err] "=a" (err)
		     : "c" (msr), "0" (low), "d" (high),
		       [fault] "i" (-EIO)
		     : "memory");
	return err;
}

在输出操作数列表中，名称占位符[err]表示输出操作数 err，约束 "a" 表示使用 a 系列（ %rax/%eax/%ax/%al ）寄存器，在此处由于变量 err为 int 类型，实际使用的是 %eax 寄存器。

在输入操作数列表中，约束 "c" 表示使用 "c" 系列（ %rcx/%ecx/%cx/%cl ）寄存器，根据入参类型判断，此处使用的是 %ecx 寄存；约束 0 表示该输入操作数与输出操作数列中的第 0 个操作数使用相同的寄存器，此处为 %eax 寄存器；约束 "d" 表示使用 "d" 系列（ %rdx/%edx/%dx/%dl ）的寄存器，此处为 %edx 寄存器。

最终执行的汇编代码只有 2 条指令：wrmsr 和 xor。wrmsr将 %edx:%eax 中的值写入 %ecx 指定的寄存器中；xor将寄存器 %eax 中的值 err 改为 0。最后，返回 err。

通过以上分析可以看到，load_percpu_segment 函数，会把per-cpu变量基地址 irq_stack_union.gs_base 即 __per_cpu_offset[cpu]，写入对应 cpu 的 MSR_GS_BASE 寄存器中。对于 BSP ，其 cpu 编号为0，故会把 __per_cpu_offset[0]的值写入 cpu0 的 MSR_GS_BASE 寄存器中。

注：关于内联汇编的知识，请参考 Linux Kernel 源码学习必备知识之：GCC 内联汇编。

3.3 APs 中 GS 寄存器的初始化

在引导处理器（BSP）完成自测及部分系统初始化进入内核代码后，从 start_kernel() -> rest_init() -> kernel_init() -> kernel_init_freeable() -> smp_init() 处进行 SMP 结构的初始化。

3.3.1 smp_init

// file: kernel/smp.c
/* Called by boot processor to activate the rest. */
void __init smp_init(void)
{
	unsigned int cpu;

	idle_threads_init();

	/* FIXME: This should be done in userspace --RR */
	for_each_present_cpu(cpu) {
		if (num_online_cpus() >= setup_max_cpus)
			break;
		if (!cpu_online(cpu))
			cpu_up(cpu);
	}

	/* Any cleanup work */
	printk(KERN_INFO "Brought up %ld CPUs\n", (long)num_online_cpus());
	smp_cpus_done(setup_max_cpus);
}

在 smp_init() 函数中，会激活不大于 setup_max_cpus 数量的 cpu（APs），启动 cpu 的工作是在 cpu_up(cpu) 函数中完成的。其中， setup_max_cpus 由命令行选项 maxcpus 和 nosmp 指定；当命令行选项中指定 nosmp 或 maxcpus = 0 时，禁止 smp 系统。两者均未指定的话，其默认值为 NR_CPUS。

命令行参数 maxcpus 的处理

// file: kernel/smp.c
/*
 * Setup routine for controlling SMP activation
 *
 * Command-line option of "nosmp" or "maxcpus=0" will disable SMP
 * activation entirely (the MPS table probe still happens, though).
 *
 * Command-line option of "maxcpus=<NUM>", where <NUM> is an integer
 * greater than 0, limits the maximum number of CPUs activated in
 * SMP mode to <NUM>.
 */

void __weak arch_disable_smp_support(void) { }

static int __init nosmp(char *str)
{
	setup_max_cpus = 0;
	arch_disable_smp_support();

	return 0;
}

static int __init maxcpus(char *str)
{
	get_option(&str, &setup_max_cpus);
	if (setup_max_cpus == 0)
		arch_disable_smp_support();

	return 0;
}

early_param("maxcpus", maxcpus);

setup_max_cpus 默认值

// file: 
/* Setup configured maximum number of CPUs to activate */
unsigned int setup_max_cpus = NR_CPUS;

3.3.2 cpu_up

// file: kernel/cpu.c
int __cpuinit cpu_up(unsigned int cpu)
{
	...
	
	err = _cpu_up(cpu, 0);

	...
        
	return err;
}

cpu_up 函数内部，调用了 _cpu_up 函数来启动 cpu。

3.3.3 _cpu_up

// file: kernel/cpu.c
/* Requires cpu_add_remove_lock to be held */
static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen)
{
	...

	/* Arch-specific enabling code. */
	ret = __cpu_up(cpu, idle);
	
    ...

	return ret;
}

_cpu_up 函数内部，调用了 __cpu_up 函数。

3.3.4 __cpu_up

// file: arch/x86/include/asm/smp.h
static inline int __cpu_up(unsigned int cpu, struct task_struct *tidle)
{
	return smp_ops.cpu_up(cpu, tidle);
}

__cpu_up 函数将参数直接透传给 smp_ops.cpu_up。

3.3.5 smp_ops.cpu_up

smp_ops 是一个结构体，内部定义了 smp 操作的相关函数，其中 smp_ops.cpu_up 的具体实现为 native_cpu_up 函数。

// file: arch/x86/kernel/smp.c
struct smp_ops smp_ops = {
	.smp_prepare_boot_cpu	= native_smp_prepare_boot_cpu,
	.smp_prepare_cpus	= native_smp_prepare_cpus,
	.smp_cpus_done		= native_smp_cpus_done,

	.stop_other_cpus	= native_stop_other_cpus,
	.smp_send_reschedule	= native_smp_send_reschedule,

	.cpu_up			= native_cpu_up,
	.cpu_die		= native_cpu_die,
	.cpu_disable		= native_cpu_disable,
	.play_dead		= native_play_dead,

	.send_call_func_ipi	= native_send_call_func_ipi,
	.send_call_func_single_ipi = native_send_call_func_single_ipi,
};

3.3.6 native_cpu_up

// file: arch/x86/kernel/smpboot.c
int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
{
	int apicid = apic->cpu_present_to_apicid(cpu);
    
	...

	err = do_boot_cpu(apicid, cpu, tidle);
	if (err) {
		pr_debug("do_boot_cpu failed %d\n", err);
		return -EIO;
	}

	...

	return 0;
}

在 native_cpu_up 函数中，获取到对应 cpu 的 apicid 后，调用 do_boot_cpu 函数启动该 cpu。

3.3.7 do_boot_cpu

// file: arch/x86/kernel/smpboot.c
/*
 * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
 * (ie clustered apic addressing mode), this is a LOGICAL apic ID.
 * Returns zero if CPU booted OK, else error code from
 * ->wakeup_secondary_cpu.
 */
static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
{
	volatile u32 *trampoline_status =
		(volatile u32 *) __va(real_mode_header->trampoline_status);
	/* start_ip had better be page-aligned! */
	unsigned long start_ip = real_mode_header->trampoline_start;

	unsigned long boot_error = 0;
	int timeout;
	int cpu0_nmi_registered = 0;
    
    ......

    initial_gs = per_cpu_offset(cpu);
    
	......
        
	initial_code = (unsigned long)start_secondary;

	......

	/*
	 * Wake up a CPU in difference cases:
	 * - Use the method in the APIC driver if it's defined
	 * Otherwise,
	 * - Use an INIT boot APIC message for APs or NMI for BSP.
	 */
	if (apic->wakeup_secondary_cpu)
		boot_error = apic->wakeup_secondary_cpu(apicid, start_ip);
	else
		boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid,
						     &cpu0_nmi_registered);

	......

	return boot_error;
}

在 do_boot_cpu 函数中，先是初始化了 start_ip，这个是 AP 被唤醒后要执行的代码入口；然后给 initial_code 赋值；最后根据不同的 apic 驱动来执行执行对应的唤醒函数来唤醒 AP。当 apic 驱动里存在 wakeup_secondary_cpu 函数时，就使用该函数；否则使用 wakeup_cpu_via_init_nmi函数。

在 x86_64 架构里，其 cpu 内部是集成了 local apic 的，本篇也会以 Intel 内置的 local apic 来讲解 smp 的初始化。但不管使用的是什么版本的 apic 驱动, start_ip 都是 AP 被唤醒后要执行的代码入口。

在 Intel 原生的 local apic 驱动里（参见文件 arch/x86/kernel/apic/apic_flat_64.c），并没有提供 wakeup_secondary_cpu 函数，所以会进入 else 分支，执行 wakeup_cpu_via_init_nmi 函数。

3.3.8 wakeup_cpu_via_init_nmi

// file: arch/x86/kernel/smpboot.c
/*
 * Wake up AP by INIT, INIT, STARTUP sequence.
 *
 * Instead of waiting for STARTUP after INITs, BSP will execute the BIOS
 * boot-strap code which is not a desired behavior for waking up BSP. To
 * void the boot-strap code, wake up CPU0 by NMI instead.
 *
 * This works to wake up soft offlined CPU0 only. If CPU0 is hard offlined
 * (i.e. physically hot removed and then hot added), NMI won't wake it up.
 * We'll change this code in the future to wake up hard offlined CPU0 if
 * real platform and request are available.
 */
static int __cpuinit
wakeup_cpu_via_init_nmi(int cpu, unsigned long start_ip, int apicid,
	       int *cpu0_nmi_registered)
{
	int id;
	int boot_error;

	/*
	 * Wake up AP by INIT, INIT, STARTUP sequence.
	 */
	if (cpu)
		return wakeup_secondary_cpu_via_init(apicid, start_ip);

	......

	return boot_error;
}

wakeup_cpu_via_init_nmi 函数内部，调用了 wakeup_secondary_cpu_via_init 函数来唤醒 APs。

3.3.9 wakeup_secondary_cpu_via_init

// file: arch/x86/kernel/smpboot.c
static int __cpuinit
wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
{
	unsigned long send_status, accept_status = 0;
	int maxlvt, num_starts, j;

	......

	/*
	 * Turn INIT on target chip
	 */
	/*
	 * Send IPI
	 */
	apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
		       phys_apicid);

	......

	mdelay(10);

	......

	/* Target chip */
	/* Send IPI */
	apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);

	......

	/*
	 * Should we send STARTUP IPIs ?
	 *
	 * Determine this based on the APIC version.
	 * If we don't have an integrated APIC, don't send the STARTUP IPIs.
	 */
	if (APIC_INTEGRATED(apic_version[phys_apicid]))
		num_starts = 2;
	else
		num_starts = 0;

	......

	/*
	 * Run STARTUP IPI loop.
	 */
	for (j = 1; j <= num_starts; j++) {
        
		......

		/*
		 * STARTUP IPI
		 */

		/* Target chip */
		/* Boot on the stack */
		/* Kick the second */
		apic_icr_write(APIC_DM_STARTUP | (start_eip >> 12),
			       phys_apicid);

		/*
		 * Give the other CPU some time to accept the IPI.
		 */
		udelay(300);

		......
            
	}
    
	......

	return (send_status | accept_status);
}

根据 Intel SDM Vol. 3A 9.4.4 MP Initialization Example 的说明，APs在经过BIST（build-in self-test）及 BIOS AP 初始化代码之后，进入挂起状态，等待 INIT IPI；而 BSP 为了唤醒挂起状态的 AP ，需要向所有 APs 广播 INIT-SIPI-SIPI IPI序列，伪代码如下图所示。

在上图中，左栏展示的是处理器数量未知时 INIT-SIPI-SIPI 序列发送流程及等待超时的时间，右栏展示的是处理器数量已知时 INIT-SIPI-SIPI 序列发送流程及等待超时的时间。可以看到，不管是左栏还是右栏，都会发送三个 IPI 。

在实际代码中可以看到，先是发送了 2 次 INIT IPI，然后判断 apic 是否是集成的，如果是集成的，说明使用的是 intel 芯片自带的 apic，需要发送 2 次 SIPI 消息。这里之所以要发送 2 次 INIT IPI，是为了兼容相对较老的处理器。第一次 INIT IPI 按照 intel 文档标准是必须要发的，是真正的 INIT IPI；第二次发送的实际是 INIT Level De-assert IPI，该 IPI 在较新的 Pentium 4 及Intel Xeon 处理器上是不支持的，但是为了兼容以前的处理器，也需要发送。剩下的 2 次 SIPI IPI，跟文档示例完全对应的上。

再来看一下 SIPI 中的参数 start_eip，可以看到 start_eip 在发送时先右移了12位。这是因为 start_eip 是作为 ICR 的 vector 字段传送的，而 vector 字段只有 8 位，无法存储大于 8 位的地址，为了解决这个问题，在 AP 收到 SIPI 是，将 vector 字段左移 12 位作为代码入口地址，所以这里需要先右移 12 位。

关于 APIC 相关的知识，可参考文章 x86_64 架构中的APIC概述或者直接查看 Intel 文档 Intel SDM Vol. 3A 11.6.1 Interrupt Command Register (ICR)。

3.3.10 start_ip

start_ip 是 AP 被唤醒后所要执行的程序入口，其值如下：

/* start_ip had better be page-aligned! */
	unsigned long start_ip = real_mode_header->trampoline_start;

real_mode_header 是一个结构体同名变量，该结构体定义在 arch/x86/include/asm/realmode.h头文件中，：

// file: arch/x86/include/asm/realmode.h
/* This must match data at realmode.S */
struct real_mode_header {
	u32	text_start;
	u32	ro_end;
	/* SMP trampoline */
	u32	trampoline_start;
	u32	trampoline_status;
	u32	trampoline_header;
#ifdef CONFIG_X86_64
	u32	trampoline_pgd;
#endif
	/* ACPI S3 wakeup */
#ifdef CONFIG_ACPI_SLEEP
	u32	wakeup_start;
	u32	wakeup_header;
#endif
	/* APM/BIOS reboot */
	u32	machine_real_restart_asm;
#ifdef CONFIG_X86_64
	u32	machine_real_restart_seg;
#endif
};

结构体变量定义在文件 arch/x86/realmode/rm/header.S中，变量成员trampoline_start 的值为 pa_trampoline_start。

// file: arch/x86/realmode/rm/header.S
GLOBAL(real_mode_header)
	.long	pa_text_start
	.long	pa_ro_end
	/* SMP trampoline */
	.long	pa_trampoline_start
	.long	pa_trampoline_status
	.long	pa_trampoline_header
#ifdef CONFIG_X86_64
	.long	pa_trampoline_pgd;
#endif
	/* ACPI S3 wakeup */
#ifdef CONFIG_ACPI_SLEEP
	.long	pa_wakeup_start
	.long	pa_wakeup_header
#endif
	/* APM/BIOS reboot */
	.long	pa_machine_real_restart_asm
#ifdef CONFIG_X86_64
	.long	__KERNEL32_CS
#endif
END(real_mode_header)

pa_trampoline_start 定义在头文件 arch/x86/realmode/rm/pasyms.h 中，其值为 trampoline_start。

// file: arch/x86/realmode/rm/pasyms.h
pa_trampoline_start = trampoline_start;

最终，通过层层跟踪，我们找到了 start_ip 的实际值，这是一段汇编代码的入口地址， AP 被唤醒后将从该地址开始执行，此时 AP 处于实模式 。

3.3.11 trampoline_start

// file: arch/x86/realmode/rm/trampoline_64.S
ENTRY(trampoline_start)
	cli			# We should be safe anyway
	wbinvd

	LJMPW_RM(1f)
1:
	mov	%cs, %ax	# Code and data in the same place
	mov	%ax, %ds
	mov	%ax, %es
	mov	%ax, %ss

	movl	$0xA5A5A5A5, trampoline_status
	# write marker for master knows we're running

	# Setup stack
	movl	$rm_stack_end, %esp

	call	verify_cpu		# Verify the cpu supports long mode
	testl   %eax, %eax		# Check for return code
	jnz	no_longmode

	/*
	 * GDT tables in non default location kernel can be beyond 16MB and
	 * lgdt will not be able to load the address as in real mode default
	 * operand size is 16bit. Use lgdtl instead to force operand size
	 * to 32 bit.
	 */

	lidtl	tr_idt	# load idt with 0, 0
	lgdtl	tr_gdt	# load gdt with whatever is appropriate

	movw	$__KERNEL_DS, %dx	# Data segment descriptor

	# Enable protected mode
	movl	$X86_CR0_PE, %eax	# protected mode (PE) bit
	movl	%eax, %cr0		# into protected mode

	# flush prefetch and jump to startup_32
	ljmpl	$__KERNEL32_CS, $pa_startup_32
	
no_longmode:
	hlt
	jmp no_longmode

trampoline_start 整体流程如下：

使用 cli 指令清除 EFLAGS 寄存器的中断标志位（ IF），清除后，处理器会忽略可屏蔽的外部中断。

使用 wbinvd（Write Back and Invalidate Cache）指令将处理器缓存中被修改过的数据回写到主存中，并使缓存失效。

LJMPW_RM(1f) 宏定义如下

// file: 
/*
 * 16-bit ljmpw to the real_mode_seg
 *
 * This must be open-coded since gas will choke on using a
 * relocatable symbol for the segment portion.
 */
#define LJMPW_RM(to)	.byte 0xea ; .word (to), real_mode_seg

该宏是远跳转指令 ljmpw 的替代品，后缀 w 表示操作数是单字大小（16 位）的。根据注释说明，直接使用 ljmpw指令的话，gas 编译时会报错，所以把该指令直接写成了机器码。其中 0xea 是远跳转指令的操作码，后面的 2 个参数中 to 表示 EIP 寄存器的值， real_mode_seg 表示 CS 寄存器的值。

real_mode_seg 定义在链接脚本 arch/x86/realmode/rm/realmode.lds.S 文件中：

// file: arch/x86/realmode/rm/realmode.lds.S
real_mode_seg = 0;

参数 to 为标签 1f，后缀 f 表示向前的最近的标签位置，1f 表示向前的最近的标签 1 的地址。

LJMPW_RM(1f) 执行后，会远跳转到标签 1 处执行。

接下来，会把%cs寄存器的值传送到 %ds 、 %es和 %ss段寄存器，这样四个段寄存器的值都是一致的。

然后，给变量 trampoline_status 赋值 0xA5A5A5A5。trampoline_status 是一个全局变量，定义如下：

// file: arch/x86/realmode/rm/trampoline_common.S
GLOBAL(trampoline_status)	.space	4

通过层层操作，该变量实际被赋值给了结构体变量 real_mode_header中的成员变量 trampoline_status。0xA5A5A5A5 是一个标记，用来探测初始化代码是否已经执行。在 do_boot_cpu 函数中，我们能够看到它的使用：

			if (*trampoline_status == 0xA5A5A5A5)
				/* trampoline started but...? */
				pr_err("CPU%d: Stuck ??\n", cpu);
			else
				/* trampoline code not run */
				pr_err("CPU%d: Not responding\n", cpu);

然后把 $rm_stack_end 的值传送到 %esp 中，作为栈顶。变量rm_stack_end定义如下：

// file: arch/x86/realmode/rm/stack.S
	.data
GLOBAL(HEAP)
	.long	rm_heap
GLOBAL(heap_end)
	.long	rm_stack

	.bss
	.balign	16
GLOBAL(rm_heap)
	.space	2048
GLOBAL(rm_stack)
	.space	2048
GLOBAL(rm_stack_end)

可以看到栈空间被初始化为 2048 个字节大小。

接下来，验证 cpu 是否支持长模式（64位模式），不支持的话直接跳转到 no_longmode 标签处执行。从代码可以看到，no_longmode 实际会把该 cpu 挂起。

然后会加载 IDT（Interrupt Descriptor Table）信息和 GDT（Global Descriptor Table）信息到对应的寄存器，供临时使用。

// file: arch/x86/realmode/rm/trampoline_64.S
	# Duplicate the global descriptor table
	# so the kernel can live anywhere
	.balign	16
	.globl tr_gdt
tr_gdt:
	.short	tr_gdt_end - tr_gdt - 1	# gdt limit
	.long	pa_tr_gdt
	.short	0
	.quad	0x00cf9b000000ffff	# __KERNEL32_CS
	.quad	0x00af9b000000ffff	# __KERNEL_CS
	.quad	0x00cf93000000ffff	# __KERNEL_DS
tr_gdt_end:

// file: arch/x86/realmode/rm/trampoline_common.S
tr_idt: .fill 1, 6, 0

然后把内核代码段描述符，传送到 %dx 寄存器。

__KERNEL_DS定义如下，可以看到内核段选择子实际编号为 3，乘以 8 是因为段选择子的低三位有其它用途，编号值实际保存在第 4 ~ 15 位，所以要左移 3 位：

// file: arch/x86/include/asm/segment.h
#define __KERNEL_DS	(GDT_ENTRY_KERNEL_DS*8)
#define GDT_ENTRY_KERNEL_DS 3

再接着，通过修改 CR0 寄存器的值打开了保护模式。宏 X86_CR0_PE 定义如下：

// file: arch/x86/include/uapi/asm/processor-flags.h
#define X86_CR0_PE	0x00000001 /* Protection Enable */

在x86_64模式中，由控制寄存器 CR0 的第 0 位（PE位）来控制保护模式的开启，CR0 寄存器结构如下：

CR0.PE

Protection Enable (bit 0 of CR0) — Enables protected mode when set; enables real-address mode when

clear. This flag does not enable paging directly. It only enables segment-level protection. To enable paging,

both the PE and PG flags must be set.

开启保护模式之后，通过远跳转指令 ljmpl 跳转到 startup_32 处执行。ljmpl 是远跳转指令，后缀 l 表示操作数为双字（32位）。参数 $__KERNEL32_CS 表示 CS 寄存器的值，参数 $pa_startup_32 表示 EIP 寄存器的值。

宏 __KERNEL32_CS 定义如下：

// file: arch/x86/include/asm/segment.h
#define __KERNEL32_CS   (GDT_ENTRY_KERNEL32_CS * 8)
#define GDT_ENTRY_KERNEL32_CS 1

pa_startup_32 定义如下：

// file: arch/x86/realmode/rm/pasyms.h
pa_startup_32 = startup_32;

ljmp 指令示例如下，其说明文档请参见 Jump (jmp, ljmp)：

Long jump, use 0xfebc for the CS register and 0x12345678 for the EIP register:
ljmp $0xfebc, $0x12345678

3.3.12 startup_32

// file: arch/x86/realmode/rm/trampoline_64.S
	.section ".text32","ax"
	.code32
	.balign 4
ENTRY(startup_32)

	......

	/*
	 * At this point we're in long mode but in 32bit compatibility mode
	 * with EFER.LME = 1, CS.L = 0, CS.D = 1 (and in turn
	 * EFER.LMA = 1). Now we want to jump in 64bit mode, to do that we use
	 * the new gdt/idt that has __KERNEL_CS with CS.L = 1.
	 */
	ljmpl	$__KERNEL_CS, $pa_startup_64

startup_32 执行到最后，通过 ljmpl 指令跳转到 pa_startup_64 处。 pa_startup_64定义如下，最终会跳转到 startup_64 处执行：

// file: arch/x86/realmode/rm/pasyms.h
pa_startup_64 = startup_64;

3.3.13 startup_64

// file: arch/x86/realmode/rm/trampoline_64.S
ENTRY(startup_64)
	# Now jump into the kernel using virtual addresses
	jmpq	*tr_start(%rip)

startup_64 处只有一行跳转代码，该跳转是间接跳转，跳转到变量 tr_start 保存的地址处开始执行。

tr_start 既是一个独立变量，同时也是结构体变量 trampoline_header 的成员 start。

// file: arch/x86/realmode/rm/trampoline_64.S
GLOBAL(trampoline_header)
	tr_start:		.space	8
	GLOBAL(tr_efer)		.space	8
	GLOBAL(tr_cr4)		.space	4
END(trampoline_header)

// file: /home/zhouz/linux-3.10/arch/x86/include/asm/realmode.h
/* This must match data at trampoline_32/64.S */
struct trampoline_header {
#ifdef CONFIG_X86_32
	u32 start;
	u16 gdt_pad;
	u16 gdt_limit;
	u32 gdt_base;
#else
	u64 start;
	u64 efer;
	u32 cr4;
#endif
};

而结构体变量 trampoline_header，是另一个结构体变量 real_mode_header的成员。

// file: arch/x86/include/asm/realmode.h
/* This must match data at realmode.S */
struct real_mode_header {
    
	......
        
	/* SMP trampoline */
	u32	trampoline_start;
	u32	trampoline_status;
	u32	trampoline_header;
    
	......
        
};

// file: arch/x86/realmode/rm/header.S
GLOBAL(real_mode_header)

	......
	
	/* SMP trampoline */
	.long	pa_trampoline_start
	.long	pa_trampoline_status
	.long	pa_trampoline_header
	
	......
	
END(real_mode_header)

// file: arch/x86/realmode/rm/pasyms.h
pa_trampoline_header = trampoline_header;
pa_trampoline_pgd = trampoline_pgd;
pa_trampoline_start = trampoline_start;
pa_trampoline_status = trampoline_status;

tr_start 初始化

上文已经说过， tr_start 和 trampoline_header->start 是同一个变量，trampoline_header->start 在 setup_real_mode 函数里被赋值为 secondary_startup_64。 setup_real_mode 函数的调用过程：start_kernel() -> setup_arch() -> setup_real_mode()。

// file: arch/x86/realmode/init.c
void __init setup_real_mode(void)
{
	......
        
	struct trampoline_header *trampoline_header;
    
	......

	/* Must be perfomed *after* relocation. */
	trampoline_header = (struct trampoline_header *)
		__va(real_mode_header->trampoline_header);

	......

	trampoline_header->start = (u64) secondary_startup_64;
    
	......
}

3.3.14 secondary_startup_64

// file: arch/x86/kernel/head_64.S
ENTRY(secondary_startup_64)
    
	......

	/* Set up %gs.
	 *
	 * The base of %gs always points to the bottom of the irqstack
	 * union.  If the stack protector canary is enabled, it is
	 * located at %gs:40.  Note that, on SMP, the boot cpu uses
	 * init data section till per cpu areas are set up.
	 */
	movl	$MSR_GS_BASE,%ecx
	movl	initial_gs(%rip),%eax
	movl	initial_gs+4(%rip),%edx
	wrmsr	

	......
	
	/* Finally jump to run C code and to be on real kernel address
	 * Since we are running on identity-mapped space we have to jump
	 * to the full 64bit address, this is only possible as indirect
	 * jump.  In addition we need to ensure %cs is set so we make this
	 * a far return.
	 *
	 * Note: do not change to far jump indirect with 64bit offset.
	 *
	 * AMD does not support far jump indirect with 64bit offset.
	 * AMD64 Architecture Programmer's Manual, Volume 3: states only
	 *	JMP FAR mem16:16 FF /5 Far jump indirect,
	 *		with the target specified by a far pointer in memory.
	 *	JMP FAR mem16:32 FF /5 Far jump indirect,
	 *		with the target specified by a far pointer in memory.
	 *
	 * Intel64 does support 64bit offset.
	 * Software Developer Manual Vol 2: states:
	 *	FF /5 JMP m16:16 Jump far, absolute indirect,
	 *		address given in m16:16
	 *	FF /5 JMP m16:32 Jump far, absolute indirect,
	 *		address given in m16:32.
	 *	REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
	 *		address given in m16:64.
	 */
	movq	initial_code(%rip),%rax
	pushq	$0		# fake return address to stop unwinder
	pushq	$__KERNEL_CS	# set correct cs
	pushq	%rax		# target address in negative space
	lretq

在 secondary_startup_64 函数中，使用 wrmsr 指令将 MSR_GS_BASE 寄存器设置为变量 initial_gs 的值。变量 initial_gs 在函数 do_boot_cpu中被赋值为 per_cpu_offset(cpu)，即当前cpu的 per-cpu 变量的基地址。关于 wrmsr 指令的用法，我们在 2.2.2 节已经介绍过，此处不再赘述。

最后，使用 lretq 指令切换执行流。lretq是远返回（far return）指令，后缀 q 表示该指令操作数是4字节大小的（64位）。lretq会先将栈顶的值（此处为 %rax 寄存器的值即 initial_code的值）弹出到 EIP 寄存器；然后再次将栈顶值（此处为宏__KERNEL_CS的值）弹出到 CS 寄存器；然后从 EIP 地址处恢复执行。initial_code 在 do_boot_cpu 函数中被赋值为 start_secondary函数的地址。

正常情况下，此处可以使用远跳转指令来完成执行流的切换，但由于处理器不支持，所以使用了远返回指令来切换执行流。

关于远调用和远返回的详细信息，可参考 Intel SDM Vol. 1 Section 6.4.2 Far CALL and RET Operation。

initial_gs 和 initial_code

// file: arch/x86/kernel/smpboot.c
static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
{
	......
        
    initial_gs = per_cpu_offset(cpu);
    
    ...
	
	initial_code = (unsigned long)start_secondary;
	
	...
}

3.3.15 start_secondary

在secondary_startup_64 函数最后，通过 lretq指令跳转到 start_secondary函数处开始执行。该函数一开始，就调用了 cpu_init()函数进行初始化。

// file: arch/x86/kernel/smpboot.c
/*
 * Activate a secondary processor.
 */
notrace static void __cpuinit start_secondary(void *unused)
{
	/*
	 * Don't put *anything* before cpu_init(), SMP booting is too
	 * fragile that we want to limit the things done here to the
	 * most necessary things.
	 */
	cpu_init();
    
	......
        
}

3.3.16 cpu_init

在 cpu_init 函数里，调用了我们熟悉的函数 switch_to_new_gdt，然后把 MSR_KERNEL_GS_BASE寄存器初始化为0。

// file: arch/x86/kernel/cpu/common.c
void __cpuinit cpu_init(void)
{
    
	......

	/*
	 * Initialize the per-CPU GDT with the boot GDT,
	 * and set up the GDT descriptor:
	 */

	switch_to_new_gdt(cpu);
    
	......
        
	wrmsrl(MSR_KERNEL_GS_BASE, 0);
    
	......

}

3.3.17 switch_to_new_gdt(cpu)

在 switch_to_new_gdt 函数里，调用 load_percpu_segment 函数重载了 per-cpu 区域的基地址。

// file: arch/x86/kernel/cpu/common.c
/*
 * Current gdt points %fs at the "master" per-cpu area: after this,
 * it's on the real one.
 */
void switch_to_new_gdt(int cpu)
{
	......
        
	/* Reload the per-cpu base */
	load_percpu_segment(cpu);
}

3.3.18 load_percpu_segment

// file: arch/x86/kernel/cpu/common.c
void load_percpu_segment(int cpu)
{
#ifdef CONFIG_X86_32
	loadsegment(fs, __KERNEL_PERCPU);
#else
	loadsegment(gs, 0);
	wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));
#endif
	load_stack_canary_segment();
}