Linux Kernel：中断和异常处理程序的早期初始化（续）我们在 Linux kernel：中断和异常处理程序的早期

本文采用Linux 内核 v3.10 版本 x86_64架构

我们在 Linux kernel：中断和异常处理程序的早期初始化里介绍了异常的早期初始化，并在 Linux Kernel：Page-Fault 异常的早期处理里介绍了早期的 Page-Fault 异常处理，接下来我们看看中断和异常后续的初始化。

现在我们来到了 start_kernel 函数，内核主要的初始化工作都是在该函数内完成的，当然也包括异常和中断的初始化。我们顺着 start_kernel 的执行流程，梳理下异常和中断的初始化过程。

一、中断栈与内核栈简介

1.1 内核栈

x86_64 架构下，对于每一个线程，都有一个内核栈。内核栈是与线程关联的，其栈大小为 THREAD_SIZE (2*PAGE_SIZE)。只要这些线程还活着或者处于僵尸（zombie）状态，那么内核栈中就包含着与线程相关的有用信息。当线程处于用户空间时，内核栈是空的，此时只有内核栈底部的 thread_info 结构体是有用的，其中保存着线程的相关信息。

1.1.1 内核栈数据结构

内核栈对应的数据结构：

// file: include/linux/sched.h
union thread_union {
	struct thread_info thread_info;
	unsigned long stack[THREAD_SIZE/sizeof(long)];
};

内核栈的底部，是 thread_info结构体，该结构体保存着线程相关的信息。thread_info结构体中的成员 task 指向线程的 task_struct 结构体：

// file: arch/x86/include/asm/thread_info.h
struct thread_info {
	struct task_struct	*task;		/* main task structure */
	...
    ...
};

同样，task_struct 结构体中也有一个成员 stack 指向 thread_info 结构体。

// file: include/linux/sched.h
struct task_struct {
    ...
    ...
        
    void *stack;
    
    ...
    ...
    
}

由于每一个线程都有自己的 task_struct 结构，所以每个线程的内核栈空间是独立的。

内核栈相关的数据结构，如下图所示：

在 per-cpu 变量 kernel_stack 中，保存着当前线程内核栈的栈底指针。

1.1.2 内核栈的初始化

kernel_stack被初始化为 init_thread_union 的栈底指针：

// file: arch/x86/kernel/cpu/common.c
DEFINE_PER_CPU(unsigned long, kernel_stack) =
	(unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE;

init_thread_union是 union thread_union类型的变量；THREAD_SIZE定义了栈的大小，扩展为 8K（2*PAGE_SIZE）：

// file: arch/x86/include/asm/page_64_types.h
#define THREAD_SIZE_ORDER	1
#define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)

KERNEL_STACK_OFFSET表示栈底偏移量，扩展为 40（5 个 8 字节指针）：

// file: arch/x86/include/asm/thread_info.h
#define KERNEL_STACK_OFFSET (5*8)

计算之后，kernel_stack 是 init_thread_union 最高地址减去 40 的处的地址，即内核栈的栈底指针。

init_thread_union 中引用了init_task：

// file: init/init_task.c
union thread_union init_thread_union __init_task_data =
	{ INIT_THREAD_INFO(init_task) };

// file: arch/x86/include/asm/thread_info.h
#define INIT_THREAD_INFO(tsk)			\
{						\
	.task		= &tsk,			\
	...				
	...
}

init_task是 task_struct 类型的变量，其内部成员 stack 引用了 init_thread_info：

// file: init/init_task.c
/* Initial task structure */
struct task_struct init_task = INIT_TASK(init_task);

// file: include/linux/init_task.h
/*
 *  INIT_TASK is used to set up the first task table, touch at
 * your own risk!. Base=0, limit=0x1fffff (=2MB)
 */
#define INIT_TASK(tsk)	\
{									\
	.state		= 0,						\
	.stack		= &init_thread_info,				\
	
	...
	...
       
}

init_thread_info 是 init_thread_union 的成员，它们的起始地址是一样的：

// file: arch/x86/include/asm/thread_info.h
#define init_thread_info	(init_thread_union.thread_info)

初始化完成后，内核栈如下图所示：

栈指针之所以比最大值小 40，是由于除了系统调用，有些异常处理也用到了内核栈；而在发生异常时，有 5 个寄存器是自动压入到栈中的：

// file: arch/x86/include/asm/calling.h
/* cpu exception frame or undefined in case of fast syscall: */
#define RIP		128
#define CS		136
#define EFLAGS		144
#define RSP		152
#define SS		160

所以实际的栈指针要从比最大值小 40 （5*8）字节处开始。

1.1.3 内核栈的切换

每次线程切换时，kernel_stack也需要跟着切换。

// file: arch/x86/kernel/process_64.c
/*
 *	switch_to(x,y) should switch tasks from x to y.
 *
 * This could still be optimized:
 * - fold all the options into a flag word and test it with a single test.
 * - could test fs/gs bitsliced
 *
 * Kprobes not supported here. Set the probe on schedule instead.
 * Function graph tracer not supported too.
 */
__notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
    ...
    ...
        
   	this_cpu_write(kernel_stack,
		  (unsigned long)task_stack_page(next_p) +
		  THREAD_SIZE - KERNEL_STACK_OFFSET);
    
    ...
    ...
        
}

其中 prev_p指向切换前线程的 task_struct，next_p 指向切换后线程的 task_struct。

1.2 中断栈

内核栈是与线程关联的，每个线程都有一个内核栈；而中断栈是与 CPU 关联的，每个 CPU 有一个中断栈。中断栈大小为 IRQ_STACK_SIZE（4*PAGE_SIZE），即 16K。中断栈用于外部硬件中断和软中断。当外部硬件中断首次发生时（例如，没有嵌套的硬件中断），内核从当前任务切换到中断栈。由于中断栈是跟 CPU 关联的，所以不会增加线程栈的大小。

1.2.1 中断栈的数据结构

与内核栈类似，中断栈的数据结构也是一个联合体 irq_stack_union：

// file: arch/x86/include/asm/processor.h
union irq_stack_union {
	char irq_stack[IRQ_STACK_SIZE];
	/*
	 * GCC hardcodes the stack canary as %gs:40.  Since the
	 * irq_stack is the object at %gs:0, we reserve the bottom
	 * 48 bytes of the irq stack for the canary.
	 */
	struct {
		char gs_base[40];
		unsigned long stack_canary;
	};
};

宏 IRQ_STACK_SIZE 定义如下，该宏又引用了 PAGE_SIZE（扩展为 4096）和 IRQ_STACK_ORDER（扩展为 2），最终 IRQ_STACK_SIZE扩展为 $2^{14}$ ，即 16K。

// file: arch/x86/include/asm/page_64_types.h
#define IRQ_STACK_ORDER 2
#define IRQ_STACK_SIZE (PAGE_SIZE << IRQ_STACK_ORDER)

宏 PAGE_SIZE定义：

// file: arch/x86/include/asm/page_types.h
/* PAGE_SHIFT determines the page size */
#define PAGE_SHIFT	12
#define PAGE_SIZE	(_AC(1,UL) << PAGE_SHIFT)

从以上分析可以看到，中断栈的大小为 16KB；其最低的 40 字节是结构体成员 gs_base ，然后是 unsigned long 类型（x86_64 架构下为 8 字节）的成员 stack_canary。也就是说，中断栈 irq_stack 底部的 48 字节是保留的，用来检测栈溢出。

中断栈结构如下图所示：

1.2.2 中断栈的初始化

中断栈 irq_stack_union是个union irq_stack_union类型的 per-cpu 变量，其起始地址对齐到 PAGE_SIZE（4K）。

// file: arch/x86/kernel/cpu/common.c
DEFINE_PER_CPU_FIRST(union irq_stack_union,
		     irq_stack_union) __aligned(PAGE_SIZE);

DEFINE_PER_CPU_FIRST宏创建了名为 .data..per-cpu..first 的节（ section），该节是 per-cpu 区域的第一个节，且该节只有 irq_stack_union一个变量。

二、中断栈中的”金丝雀“

boot_init_stack_canary 函数为中断栈设置”金丝雀“（canary），以检测栈溢出。该函数定义如下：

// file: arch/x86/include/asm/stackprotector.h
/*
 * Initialize the stackprotector canary value.
 *
 * NOTE: this must only be called from functions that never return,
 * and it must always be inlined.
 */
static __always_inline void boot_init_stack_canary(void)
{
	u64 canary;
	u64 tsc;

#ifdef CONFIG_X86_64
	BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
#endif
	/*
	 * We both use the random pool and the current TSC as a source
	 * of randomness. The TSC only matters for very early init,
	 * there it already has some randomness on most systems. Later
	 * on during the bootup the random pool has true entropy too.
	 */
	get_random_bytes(&canary, sizeof(canary));
	tsc = __native_read_tsc();
	canary += tsc + (tsc << 32UL);

	current->stack_canary = canary;
#ifdef CONFIG_X86_64
	this_cpu_write(irq_stack_union.stack_canary, canary);
#else
	this_cpu_write(stack_canary.canary, canary);
#endif
}

在 x86-64 架构下，会检查结构体成员 stack_canary 在 irq_stack_union中的偏移量是否等于 40，不相等的话则报错。

上文已经介绍过，union irq_stack_union是一个联合体，其大小为 16KB；最低的 40 字节是结构体成员 gs_base ，然后是 unsigned long 类型（x86_64 架构下为 8 字节）的成员 stack_canary。也就是说，中断栈 irq_stack 底部的 48 字节是保留的，用来检测栈溢出。

2.1 计算成员偏移量 -- offsetof

offsetof 宏用来计算结构体中某个成员的偏移量，其定义如下：

// file: include/linux/stddef.h
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

通过将数字 0 转型为对应结构体类型的指针，然后获取到该类型的成员地址。由于结构体类型地址是从 0 开始，所以获取到的成员地址就是该成员在结构体的偏移量。

2.2 编译时错误检查 -- BUILD_BUG_ON

// file: include/linux/bug.h
#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))

BUILD_BUG_ON 用来检查编译时错误，这里使用了一点小技巧，我们一起来看一下。!!conditions 等同于 condition != 0，当 condition 为真时， !!(condition) 的值为 1；否则为 0。所以 2*!!(condition) 的值，要么为 2 ，要么为 0；而 1 - 2*!!(condition) 则可能为 -1 或者 1。

这就会导致 2 种不同的结果：

当条件 condition 为真时，组数下标为 -1 ，我们会得到一个编译时错误；
否则，无编译错误。

2.3 生成“金丝雀”

接下来，基于随机数和时间戳计数器（ Time Stamp Counter），生成”金丝雀“（canary）的值：

get_random_bytes(&canary, sizeof(canary));
tsc = __native_read_tsc();
canary += tsc + (tsc << 32UL);

2.4 current 是个宏

生成”金丝雀“（canary）后，会把 canary 的值写入当前进程 task_struct 的结构体成员 stack_canary 中。

current->stack_canary = canary;

current是个宏，该宏会获取到当前 CPU 上正在运行的进程的 task_struct 结构体指针。

// file: arch/x86/include/asm/current.h
#define current get_current()

static __always_inline struct task_struct *get_current(void)
{
	return this_cpu_read_stable(current_task);
}

current_task 是 per-cpu 变量，其初始化为 init_task 地址：

// file: arch/x86/kernel/cpu/common.c
DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned =
	&init_task;

每次进程切换时，都会把切换到的进程的 task_struct 指针写入 current_task：

/*
 *	switch_to(x,y) should switch tasks from x to y.
 *
 */
__notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
    ...
    ...
        
    this_cpu_write(current_task, next_p);
    
    ...
    ...
    
}

__switch_to 函数中的 next_p是切换后要运行进程的 task_struct 指针。this_cpu_write(current_task, next_p);会把 next_p的值写入当前 CPU 对应的per-cpu 变量 current_task 中。

关于 per-cpu 变量的知识，请参考Linux Kernel 源码学习：PER_CPU 变量、swapgs及栈切换（一）以及 Linux kernel documentation。

task_struct结构体中的 stack_canary成员：

struct task_struct {
    ...
    ...
        
#ifdef CONFIG_CC_STACKPROTECTOR
	/* Canary value for the -fstack-protector gcc feature */
	unsigned long stack_canary;
#endif
    
    ...
    ...
}

2.5 把 canary 写入中断栈

并使用 this_cpu_write 宏把 canary 的值写入irq_stack_union中：

this_cpu_write(irq_stack_union.stack_canary, canary);

通过this_cpu_write，将 canary 的值写入 per-cpu 变量 irq_stack_union 的成员 stack_canary 中去。

三、禁止和开启本地中断

local_disable_irq 和 local_irq_enable用来禁止和开启本地中断。这 2 个宏的实现依赖于内核配置选项 CONFIG_TRACE_IRQFLAGS_SUPPORT ，当 CONFIG_TRACE_IRQFLAGS_SUPPORT 开启时，它们分别包含 trace_hardirqs_off 和 trace_hardirqs_on函数，用来追踪硬件中断的禁止与开启事件。本文不会介绍追踪相关的内容，这两个函数我们暂时略过。

// file: include/linux/irqflags.h
/*
 * The local_irq_*() APIs are equal to the raw_local_irq*()
 * if !TRACE_IRQFLAGS.
 */
#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
#define local_irq_enable() \
	do { trace_hardirqs_on(); raw_local_irq_enable(); } while (0)
#define local_irq_disable() \
	do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0)
...
...
...

#else /* !CONFIG_TRACE_IRQFLAGS_SUPPORT */

#define local_irq_enable()	do { raw_local_irq_enable(); } while (0)
#define local_irq_disable()	do { raw_local_irq_disable(); } while (0)
    
...

#endif /* CONFIG_TRACE_IRQFLAGS_SUPPORT */

local_irq_disable 宏会引用 raw_local_irq_disable宏，该宏最终会调用 native_irq_disable 函数。

local_irq_enable 宏会引用 raw_local_irq_enable宏，该宏最终会调用 native_irq_enable 函数。

3.1 raw_local_irq_disable / raw_local_irq_enable

raw_local_irq_disable 和 raw_local_irq_enable 宏，分别调用了 arch_local_irq_disable 和 arch_local_irq_enable 函数。

// file: include/linux/irqflags.h
#define raw_local_irq_disable()		arch_local_irq_disable()
#define raw_local_irq_enable()		arch_local_irq_enable()

在 arch_local_irq_disable 和 arch_local_irq_enable的调用过程中涉及到半虚拟化相关的知识，这部分我们暂不涉及。

// file: arch/x86/include/asm/paravirt.h
static inline notrace void arch_local_irq_disable(void)
{
	PVOP_VCALLEE0(pv_irq_ops.irq_disable);
}

static inline notrace void arch_local_irq_enable(void)
{
	PVOP_VCALLEE0(pv_irq_ops.irq_enable);
}

// file: arch/x86/kernel/paravirt.c
struct pv_irq_ops pv_irq_ops = {
	...
        
	.irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
	.irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
    
	...
};

local_irq_disable 宏最终会调用 native_irq_disable 函数；local_irq_enable 宏最终会调用 native_irq_enable 函数。

// file: arch/x86/include/asm/irqflags.h
static inline void native_irq_disable(void)
{
	asm volatile("cli": : :"memory");
}

static inline void native_irq_enable(void)
{
	asm volatile("sti": : :"memory");
}

native_irq_disable函数，内部实现为汇编指令 cli。cli 指令会将状态寄存器 RFLAGS 中的 IF 标志设置为 0，用来禁止外部（硬件）中断。

native_irq_enable函数，内部实现为汇编指令 sti。sti 指令会将状态寄存器 RFLAGS 中的 IF 标志设置为 1，用来开启外部（硬件）中断。

四、setup_arch中的初始化函数

4.1 early_trap_init

在 early_trap_init函数里，设置了与调试相关的异常处理程序。

// file: arch/x86/kernel/traps.c
/* Set of traps needed for early debugging. */
void __init early_trap_init(void)
{
	set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
	/* int3 can be called from all */
	set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
#ifdef CONFIG_X86_32
	set_intr_gate(X86_TRAP_PF, &page_fault);
#endif
	load_idt(&idt_descr);
}

其中 X86_TRAP_DB 和 X86_TRAP_BP 分别表示 Debug（1号，#DB）异常和 Breakpoint（3号，#BP）异常。

// file: arch/x86/include/asm/traps.h
/* Interrupts/Exceptions */
enum {
	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
	X86_TRAP_DB,		/*  1, Debug */
	X86_TRAP_NMI,		/*  2, Non-maskable Interrupt */
	X86_TRAP_BP,		/*  3, Breakpoint */
	X86_TRAP_OF,		/*  4, Overflow */
	X86_TRAP_BR,		/*  5, Bound Range Exceeded */
	X86_TRAP_UD,		/*  6, Invalid Opcode */
	X86_TRAP_NM,		/*  7, Device Not Available */
	X86_TRAP_DF,		/*  8, Double Fault */
	X86_TRAP_OLD_MF,	/*  9, Coprocessor Segment Overrun */
	X86_TRAP_TS,		/* 10, Invalid TSS */
	X86_TRAP_NP,		/* 11, Segment Not Present */
	X86_TRAP_SS,		/* 12, Stack Segment Fault */
	X86_TRAP_GP,		/* 13, General Protection Fault */
	X86_TRAP_PF,		/* 14, Page Fault */
	X86_TRAP_SPURIOUS,	/* 15, Spurious Interrupt */
	X86_TRAP_MF,		/* 16, x87 Floating-Point Exception */
	X86_TRAP_AC,		/* 17, Alignment Check */
	X86_TRAP_MC,		/* 18, Machine Check */
	X86_TRAP_XF,		/* 19, SIMD Floating-Point Exception */
	X86_TRAP_IRET = 32,	/* 32, IRET Exception */
};

可以看到，在 early_trap_init 里，调用了 3 个不同的函数来安装中断处理程序：

set_intr_gate_ist
set_system_intr_gate_ist
set_intr_gate

其中 set_intr_gate函数的实现，我们在 Linux kernel：中断和异常处理程序的早期初始化中已经详细介绍过了。set_intr_gate_ist与set_system_intr_gate_ist函数的实现，与set_intr_gate类似。

我们先来看下 set_intr_gate_ist函数，该函数名称中带有 _ist 后缀。带有这种后缀的函数，说明异常处理程序使用了中断栈表（Interrupt Stack Table，IST）。

中断栈表（ Interrupt Stack Table，IST）机制是 x86_64 架构下新增的一种机制，仅在 x86_64 架构下中断及异常处理需要栈切换时使用。中断栈表是一种可选机制，是否启用由中断描述符表（ IDT ）中门描述的 IST 字段决定。 IST 字段共 3 位，提供了8 种可能性。当 IST 字段为 0 时，表示不启用 IST 机制。IST 机制最多可以提供 7 个 IST 指针（1 ~ 7），每个指针 64 位大小，这些指针保存在任务状态段（ Task-State Segment， TSS ）中。当启用了 IST 机制，在栈切换时，TSS 中对应的 IST 指针会被加载到 RSP 中。

在进入实际代码分析前，我们介绍一下与异常处理相关的基本概念。

4.1.1 中断门及陷阱门格式

64 位模式下，中断门及陷阱门格式如下图所示：

x86 架构支持 4 种门，分别是：

调用门
中断门
陷阱门
任务门

不同的类型的门，通过门描述符中的 TYPE 字段来区分：

可以看到，在 64 位模式下，调用门的 Type 值为 12（二进制 1100），中断门的 Type 值为14（二进制 1110），陷阱门的 Type 值为15（二进制 1111）。同时也可以看到，在 64 位模式下，并没有任务门。

4.1.2 任务状态段格式

64 位模式下，任务状态段 TSS 的格式如下所示：

4.1.3 中断栈表及 TSS 初始化

set_intr_gate_ist 和 set_system_intr_gate_ist 函数都使用了中断栈表，由于 #DB 和 #BP 均为调试用途，它们使用的是同一个异常栈 -- 调试栈。

	set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
	/* int3 can be called from all */
	set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);

宏 DEBUG_STACK扩展为 4，表示调试栈在中断栈表的索引为 4，中断栈表各索引定义如下：

// file: arch/x86/include/asm/page_64_types.h
#define STACKFAULT_STACK 1
#define DOUBLEFAULT_STACK 2
#define NMI_STACK 3
#define DEBUG_STACK 4
#define MCE_STACK 5
#define N_EXCEPTION_STACKS 5  /* hw limit: 7 */

其中宏 N_EXCEPTION_STACKS 定义了中断栈表实际使用的数量，该值扩展为 5。从上文可知，中断栈表最多可以容纳 7 个指针，即 7 个栈地址。

中断栈表中的各类栈，其大小和基地址定义如下：

// file: arch/x86/kernel/cpu/common.c
/*
 * Special IST stacks which the CPU switches to when it calls
 * an IST-marked descriptor entry. Up to 7 stacks (hardware
 * limit), all of them are 4K, except the debug stack which
 * is 8K.
 */
static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
};

static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);

宏EXCEPTION_STKSZ扩展为 4K，宏 DEBUG_STKSZ 扩展为 8K：

// file: arch/x86/include/asm/page_64_types.h
#define EXCEPTION_STACK_ORDER 0
#define EXCEPTION_STKSZ (PAGE_SIZE << EXCEPTION_STACK_ORDER)	// EXCEPTION_STKSZ 扩展为 4K

#define DEBUG_STACK_ORDER (EXCEPTION_STACK_ORDER + 1)			// DEBUG_STACK_ORDER 扩展为 1
#define DEBUG_STKSZ (PAGE_SIZE << DEBUG_STACK_ORDER)			// DEBUG_STKSZ 扩展为 8K，PAGE_SIZE 扩展为 4K

综上所述，exception_stack_sizes变量定义了中断栈表中各异常栈的大小，除了调试栈为 8K 外，其它栈均为 4K。

exception_stacks 变量定义了各栈的基地址，该变量是 per-cpu 变量，也就是说每个处理器都有一份该变量的副本。

exception_stacks 结构图如下：

中断栈表及 TSS 的初始化过程如下：

// file: arch/x86/kernel/cpu/common.c
void __cpuinit cpu_init(void)
{
	struct orig_ist *oist;
	struct task_struct *me;
	struct tss_struct *t;
	unsigned long v;
	int cpu;
	int i;
    
    ...
        
    cpu = stack_smp_processor_id();
	t = &per_cpu(init_tss, cpu);
	oist = &per_cpu(orig_ist, cpu);
    
    ...
        
	/*
	 * set up and load the per-CPU TSS
	 */
	if (!oist->ist[0]) {
		char *estacks = per_cpu(exception_stacks, cpu);

		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
			estacks += exception_stack_sizes[v];
			oist->ist[v] = t->x86_tss.ist[v] =
					(unsigned long)estacks;
			if (v == DEBUG_STACK-1)
				per_cpu(debug_stack_addr, cpu) = (unsigned long)estacks;
		}
	}
    
    ...
}

在上述代码中，有 2 个结构体需要关注，分别是 struct orig_ist 和 struct tss_struct。

struct orig_ist定义如下：

// file: arch/x86/include/asm/processor.h
/*
 * Save the original ist values for checking stack pointers during debugging
 */
struct orig_ist {
	unsigned long		ist[7];
};

结构体 orig_ist表示中断栈表，其成员 ist是一个长度为 7 的数组，对应着中断栈表的最大指针数量。

tss_struct 结构体定义如下：

// file: arch/x86/include/asm/processor.h
struct tss_struct {
	/*
	 * The hardware state:
	 */
	struct x86_hw_tss	x86_tss;

	...
    ...

} ____cacheline_aligned;

结构体 x86_hw_tss 定义如下：

// file: arch/x86/include/asm/processor.h
struct x86_hw_tss {
	u32			reserved1;
	u64			sp0;
	u64			sp1;
	u64			sp2;
	u64			reserved2;
	u64			ist[7];
	u32			reserved3;
	u32			reserved4;
	u16			reserved5;
	u16			io_bitmap_base;

} __attribute__((packed)) ____cacheline_aligned;

可以看到，结构体 x86_hw_tss 中的成员与 TSS 中的字段是一一对应的。

接下来，我们分析 TSS 与中断栈表的初始化过程。

首先，获取当前 cpu 编号。

cpu = stack_smp_processor_id();

stack_smp_processor_id 是一个宏，其定义如下：

// file: arch/x86/include/asm/smp.h
#define stack_smp_processor_id()					\
({								\
	struct thread_info *ti;						\
	__asm__("andq %%rsp,%0; ":"=r" (ti) : "0" (CURRENT_MASK));	\
	ti->cpu;							\
})

CURRENT_MASK 是 THREAD_SIZE（扩展到 8K）的掩码，其定义如下：

// file: arch/x86/include/asm/page_64_types.h
#define THREAD_SIZE_ORDER	1
#define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)	//  THREAD_SIZE 扩展为 8K，其中 PAGE_SIZE 扩展为 4K，见上文
#define CURRENT_MASK (~(THREAD_SIZE - 1))

stack_smp_processor_id宏的工作原理如下所述：

处理器当前工作在内核态，内核栈的数据结构是 thread_union。thread_union是一个共同体，包括栈和 thread_info结构体。也就是说，在栈的底部，是thread_info结构体。thread_info结构体内部的成员变量 cpu，记录了当前 CPU 的编号：

// file: arch/x86/include/asm/thread_info.h
struct thread_info {
	...
        
	__u32			cpu;		/* current CPU */
    
	...
};

由于内核栈是对齐到 THREAD_SIZE的，所以通过汇编代码

    __asm__("andq %%rsp,%0; ":"=r" (ti) : "0" (CURRENT_MASK));

将 %rsp 的值和 CURRENT_MASK进行与操作，就得到了栈的最低地址，即 thread_union 的起始地址，同时也是 thread_info 的起始地址。

然后通过 ti->cpu，获取到 thread_info中的 cpu 成员，即当前 CPU 编号。

获取到当前 CPU 编号之后，我们来看后两行代码。

t = &per_cpu(init_tss, cpu);
oist = &per_cpu(orig_ist, cpu);

orig_ist 和 init_tss 都是 per-cpu 变量，它们是通过 DEFINE_PER_CPU* 相关的API 创建出来的。

// file: arch/x86/kernel/cpu/common.c
DEFINE_PER_CPU(struct orig_ist, orig_ist);

// file: arch/x86/kernel/process.c
/*
 * per-CPU TSS segments. Threads are completely 'soft' on Linux,
 * no more per-task TSS's. The TSS size is kept cacheline-aligned
 * so they are allowed to end up in the .data..cacheline_aligned
 * section. Since TSS's are completely CPU-local, we want them
 * on exact cacheline boundaries, to eliminate cacheline ping-pong.
 */
DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, init_tss) = INIT_TSS;

per_cpu 宏接收 2 个参数，分别是变量名及 CPU编号。该宏可以获取到指定 CPU 的 per-cpu 变量的值，其定义如下：

// file: include/asm-generic/percpu.h
#define per_cpu(var, cpu) \
	(*SHIFT_PERCPU_PTR(&(var), per_cpu_offset(cpu)))

关于DEFINE_PER_CPU及 per_cpu 宏的具体实现，请参考：Linux Kernel 源码学习：PER_CPU 变量、swapgs及栈切换（一）。

更多 per-cpu 变量相关的操作，可参考文档：this_cpu_ops.txt。

另外，我们看到，init_tss 被初始化为 INIT_TSS， INIT_TSS 定义如下：

// file: arch/x86/include/asm/processor.h
#define INIT_TSS  { \
	.x86_tss.sp0 = (unsigned long)&init_stack + sizeof(init_stack) \
}

可以看到，在 INIT_TSS 中，对 TSS 中的 sp0（特权级 0 的栈指针，也就是内核栈的栈底）进行了初始化。init_stack定义如下：

// file: arch/x86/include/asm/thread_info.h
#define init_stack		(init_thread_union.stack)

init_thread_union是一个 union thread_union的实例。

// file: include/linux/sched.h
extern union thread_union init_thread_union;

union thread_union {
	struct thread_info thread_info;
	unsigned long stack[THREAD_SIZE/sizeof(long)];
};

(unsigned long)&init_stack + sizeof(init_stack)计算得到的是init_thread_union的最大内存地址。由于栈是向下增长的，所以该地址就是栈底地址。

再接下来，就是对中断栈表及 TSS 进行正式初始化了。

	/*
	 * set up and load the per-CPU TSS
	 */
	if (!oist->ist[0]) {
		char *estacks = per_cpu(exception_stacks, cpu);

		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
			estacks += exception_stack_sizes[v];
			oist->ist[v] = t->x86_tss.ist[v] =
					(unsigned long)estacks;
			if (v == DEBUG_STACK-1)
				per_cpu(debug_stack_addr, cpu) = (unsigned long)estacks;
		}
	}

由于使用 DEFINE_PER_CPU 宏创建的 per-cpu 变量，其内存都会被初始化为 0。所以，一开始，通过 !oist->ist[0] 来检查中断栈表是否已经初始化。如果未初始化，则开始执行初始化流程。

上文已经提到，exception_stacks 是 per-cpu 变量，为各异常栈分配了空间。通过 per_cpu宏，就能获取到指定 CPU 的exception_stacks变量的值。然后通过循环，先是通过 estacks += exception_stack_sizes[v] 获取到各异常栈的起始地址，然后将地址写入到 oist 及 x86_tss中去。当异常栈类型为调试栈时，把调试栈的基地址写入 per-cpu 变量 debug_stack_addr中，供以后使用。至此，中断栈表及 TSS 填充完毕。

4.1.4 set_intr_gate_ist 函数

set_intr_gate_ist 函数实现如下：

// file: arch/x86/include/asm/desc.h
static inline void set_intr_gate_ist(int n, void *addr, unsigned ist)
{
	BUG_ON((unsigned)n > 0xFF);
	_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
}

首先，检查传入的中断向量是否大于系统最大向量号 255（0xFF），如果比最大值还大，那么报错并挂起进程。然后通过 _set_gate函数，向中断描述符表中插入新的门描述符。门的类型为 GATE_INTERRUPT，即中断门，其值为 0xE（十进制 14）。

不同门的类型值定义在 arch/x86/include/asm/desc_defs.h 文件中：

// file: arch/x86/include/asm/desc_defs.h
enum {
	GATE_INTERRUPT = 0xE,
	GATE_TRAP = 0xF,
	GATE_CALL = 0xC,
	GATE_TASK = 0x5,
};

_set_gate函数实现如下：

// file: arch/x86/include/asm/desc.h
static inline void _set_gate(int gate, unsigned type, void *addr,
			     unsigned dpl, unsigned ist, unsigned seg)
{
	gate_desc s;

	pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
	/*
	 * does not need to be atomic because it is only done once at
	 * setup time
	 */
	write_idt_entry(idt_table, gate, &s);
}

该函数我们在 Linux kernel：中断和异常处理程序的早期初始化中详细分析过，此处不再赘述。

4.1.5 set_system_intr_gate_ist 函数

set_system_intr_gate_ist 函数定义如下：

// file: arch/x86/include/asm/desc.h
static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
{
	BUG_ON((unsigned)n > 0xFF);
	_set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);
}

与 set_intr_gate_ist 函数类似，该函数先是检查了中断号，然后调用了 _set_gate 函数。与 set_intr_gate_ist 函数不同的是，该函数传给 _set_gate 的第 4 个参数值为 0x3；而 set_intr_gate_ist 函数中，_set_gate 的第 4 个参数为 0。观察 _set_gate 函数可以发现，该函数第 4 个参数为 dpl，即描述符特权级别。所以，breakpoint（#BP）异常，其中断门描述符的 dpl为 0x3；debug（#DB）异常，其中断门描述符的dpl为 0。

我们在 x86-64：特权级保护及程序控制转移中介绍过中断及异常的特权级保护。

只有通过 INT n, INT3, 或 INTO 指令生成的中断或异常，处理器才会检查中断或陷阱门的 DPL。此时，CPL 必须小于等于门的 DPL（CPL ≤ DPL）。

现在，breakpoint（#BP）异常的门描述符的 dpl为 0x3，也就是说，在用户空间，就可以调用 int3 指令。

4.1.6 INT3 示例

我们写一个程序来测试下断点（breakpoint）异常。

// breakpoint.c
#include <stdio.h>

int main() {
    int i;
    for (i = 0; i < 10; i++){
        printf("i is: %d\n", i);
        __asm__("int3");
    }
}

编译

gcc -o breakpoint breakpoint.c

如果直接运行会报错，输出结果如下：

$ ./breakpoint 
i is: 0
Trace/breakpoint trap (core dumped)

如果在 gdb 下运行，则可以看到断点效果：

$ gdb breakpoint 
...
...

(gdb) r
Starting program: /home/zhouz/csapp/c_test/breakpoint 
i is: 0

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000555555554672 in main ()
(gdb) c
Continuing.
i is: 1

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000555555554672 in main ()
(gdb) c
Continuing.
i is: 2

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000555555554672 in main ()
(gdb) c
Continuing.
i is: 3

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000555555554672 in main ()

...
...

4.2 early_trap_pf_init

early_trap_pf_init函数内部，调用 set_intr_gate函数，安装新的中断门。该中断门用来处理 Page-Fault 异常，即 14 号（#PF）异常。

// file: arch/x86/kernel/traps.c
void __init early_trap_pf_init(void)
{
#ifdef CONFIG_X86_64
	set_intr_gate(X86_TRAP_PF, &page_fault);
#endif
}

set_intr_gate函数在其它文章已经介绍过了，此处不再赘述。

五、参考资料

1、Intel 开发者手册：Intel 64 and IA-32 Architectures Software Developer Manuals Volume 3A, Chapter 4 Paging

2、Linux kernel：中断和异常处理程序的早期初始化

3、Linux Kernel 源码学习：PER_CPU 变量、swapgs及栈切换（一）。

4、内核文档：this_cpu_ops.txt。

Linux Kernel：中断和异常处理程序的早期初始化（续）

一、 中断栈与内核栈简介