ImperialViolet

315 阅读9分钟
原文链接: www.imperialviolet.org

(This post uses x86-64 for il­lus­tra­tion through­out. The fun­da­men­tals are sim­i­lar for other plat­forms but will need some trans­la­tion that I don't cover here.)

De­spite com­pil­ers get­ting bet­ter over time, it's still the case that hand-writ­ten as­sem­bly can be worth­while for cer­tain hot-spots. Some­times there are spe­cial CPU in­struc­tions for the thing that you're try­ing to do, some­times you need de­tailed con­trol of the re­sult­ing code and, to some ex­tent, it re­mains pos­si­ble for some peo­ple to out-op­ti­mise a com­piler.

But hand-writ­ten as­sem­bly doesn't au­to­mat­i­cally get some of the things that the com­piler gen­er­ates for nor­mal code, such as de­bug­ging in­for­ma­tion. Per­haps your as­sem­bly code never crashes (al­though any func­tion that takes a pointer can suf­fer from bugs in other code) but you prob­a­bly still care about ac­cu­rate pro­fil­ing in­for­ma­tion. In order for de­bug­gers to walk up the stack in a core file, or for pro­fil­ers to cor­rectly ac­count for CPU time, they need be able to un­wind call frames.

Un­wind­ing used to be easy as every func­tion would have a stan­dard pro­logue:

push rbp
mov rbp, rsp

This would make the stack look like this (re­mem­ber that stacks grow down­wards in mem­ory):

Caller's stackRSP value before CALLRSP at function entryCaller's RBPCallee's local variablesSaved RIP(pushed by CALL)RBP always points here

So, upon entry to a func­tion, the CALL in­struc­tion that jumped to the func­tion in ques­tion will have pushed the pre­vi­ous pro­gram counter (from the RIP reg­is­ter) onto the stack. Then the func­tion pro­logue saves the cur­rent value of RBP on the stack and copies the cur­rent value of the stack pointer into RBP. From this point until the func­tion is com­plete, RBP won't be touched.

This makes stack un­wind­ing easy be­cause RBP al­ways points to the call frame for the cur­rent func­tion. That gets you the saved ad­dress of the par­ent call and the saved value of its RBP and so on.

The prob­lems with this scheme are that a) the func­tion pro­logue can be ex­ces­sive for small func­tions and b) we would like to be able to use RBP as a gen­eral pur­pose reg­is­ter to avoid spills. Which is why the GCC doc­u­men­ta­tion says that “-O also turns on -fomit-frame-pointer on ma­chines where doing so does not in­ter­fere with de­bug­ging”. This means that you can't de­pend on being able to un­wind stacks like this. A process can be com­prised of var­i­ous shared li­braries, any of which might be com­piled with op­ti­mi­sa­tions.

To be able to un­wind the stack with­out de­pend­ing on this con­ven­tion, ad­di­tional de­bug­ging ta­bles are needed. The com­piler will gen­er­ate these au­to­mat­i­cally (when asked) for code that it gen­er­ates, but it's some­thing that we need to worry about when writ­ing as­sem­bly func­tions our­selves if we want pro­fil­ers and de­bug­gers to work.

The ref­er­ence for the as­sem­bly di­rec­tives that we'll need is here, but they are very lightly doc­u­mented. You can un­der­stand more by read­ing the DWARF spec, which doc­u­ments the data that is being gen­er­ated. Specif­i­cally see sec­tions 6.4 and D.6. But I'll try to tie the two to­gether in this post.

The ta­bles that we need the as­sem­bler to emit for us are called Call Frame In­for­ma­tion (CFI). (Not to be con­fused with Con­trol Flow In­tegrity, which is very dif­fer­ent.) Based on that name, all the as­sem­bler di­rec­tives begin with .cfi_.

Next we need to de­fine the Canon­i­cal Frame Ad­dress (CFA). This is the value of the stack pointer just be­fore the CALL in­struc­tion in the par­ent func­tion. In the di­a­gram above, it's the value in­di­cated by “RSP value be­fore CALL”. Our first task will be to de­fine data that al­lows the CFA to be cal­cu­lated for any given in­struc­tion.

The CFI ta­bles allow the CFA to be ex­pressed as a reg­is­ter value plus an off­set. For ex­am­ple, im­me­di­ately upon func­tion entry the CFA is RSP + 8. (The eight byte off­set is be­cause the CALL in­struc­tion will have pushed the pre­vi­ous RIP on the stack.)

As the func­tion ex­e­cutes, how­ever, the ex­pres­sion will prob­a­bly change. If noth­ing else, after push­ing a value onto the stack we would need to in­crease the off­set.

So one de­sign for the CFI table would be to store a (reg­is­ter, off­set) pair for every in­struc­tion. Con­cep­tu­ally that's what we do but, to save space, only changes from in­struc­tion to in­struc­tion are stored.

It's time for an ex­am­ple, so here's a triv­ial as­sem­bly func­tion that in­cludes CFI di­rec­tives and a run­ning com­men­tary.

  .globl  square
  .type   square,@function
  .hidden square
square:

This is a stan­dard pre­am­ble for a func­tion that's un­re­lated to CFI. Your as­sem­bly code should al­ready be full of this.

  .cfi_startproc

Our first CFI di­rec­tive. This is needed at the start of every an­no­tated func­tion. It causes a new CFI table for this func­tion to be ini­tialised.

  .cfi_def_cfa rsp, 8

This is defin­ing the CFA ex­pres­sion as a reg­is­ter plus off­set. One of the things that you'll see com­pil­ers do is ex­press the reg­is­ters as num­bers rather than names. But, at least with GAS, you can write names. (I've in­cluded a table of DWARF reg­is­ter num­bers and names below in case you need it.)

Get­ting back to the di­rec­tive, this is just spec­i­fy­ing what I dis­cussed above: on entry to a func­tion, the CFA is at RSP + 8.

  push    rbp
  .cfi_def_cfa rsp, 16

After push­ing some­thing to the stack, the value of RSP will have changed so we need to up­date the CFA ex­pres­sion. It's now RSP + 16, to ac­count for the eight bytes we pushed.

  mov     rbp, rsp
  .cfi_def_cfa rbp, 16

This func­tion hap­pens to have a stan­dard pro­logue, so we'll save the frame pointer in RBP, fol­low­ing the old con­ven­tion. Thus, for the rest of the func­tion we can de­fine the CFA as RBP + 16 and ma­nip­u­late the stack with­out hav­ing to worry about it again.

  mov     DWORD PTR [rbp-4], edi
  mov     eax, DWORD PTR [rbp-4]
  imul    eax, DWORD PTR [rbp-4]
  pop     rbp
  .cfi_def_cfa rsp, 8

We're get­ting ready to re­turn from this func­tion and, after restor­ing RBP from the stack, the old CFA ex­pres­sion is in­valid be­cause the value of RBP has changed. So we de­fine it as RSP + 8 again.

  ret
  .cfi_endproc

At the end of the func­tion we need to trig­ger the CFI table to be emit­ted. (It's an error if a CFI table is left open at the end of the file.)

The CFI ta­bles for an ob­ject file can be dumped with objdump -W and, if you do that for the ex­am­ple above, you'll see two ta­bles: some­thing called a CIE and some­thing called an FDE.

The CIE (Com­mon In­for­ma­tion Entry) table con­tains in­for­ma­tion com­mon to all func­tions and it's worth tak­ing a look at it:

… CIE
  Version:         1
  Augmentation:    "zR"
  Code alignment factor: 1
  Data alignment factor: -8
  Return address column: 16
  Augmentation data:     1b

  DW_CFA_def_cfa: r7 (rsp) ofs 8
  DW_CFA_offset: r16 (rip) at cfa-8

You can ig­nore every­thing until the DW_CFA_… lines at the end. They de­fine CFI di­rec­tives that are com­mon to all func­tions (that ref­er­ence this CIE). The first is say­ing that the CFA is at RSP + 8, which is what we had al­ready de­fined at func­tion entry. This means that you don't need a CFI di­rec­tive at the be­gin­ning of the func­tion. Ba­si­cally RSP + 8 is al­ready the de­fault.

The sec­ond di­rec­tive is some­thing that we'll get to when we dis­cuss sav­ing reg­is­ters.

If we look at the FDE (Frame De­scrip­tion Entry) for the ex­am­ple func­tion that we de­fined, we see that it re­flects the CFI di­rec­tives from the as­sem­bly:

… FDE cie=…
  DW_CFA_advance_loc: 1 to 0000000000000001
  DW_CFA_def_cfa: r7 (rsp) ofs 16
  DW_CFA_advance_loc: 3 to 0000000000000004
  DW_CFA_def_cfa: r6 (rbp) ofs 16
  DW_CFA_advance_loc: 11 to 000000000000000f
  DW_CFA_def_cfa: r7 (rsp) ofs 8

The FDE de­scribes the range of in­struc­tions that it's valid for and is a se­ries of op­er­a­tions to ei­ther up­date the CFA ex­pres­sion, or to skip over the next n bytes of in­struc­tions. Fairly ob­vi­ous.

Op­ti­mi­sa­tions for CFA di­rec­tives

There are some short­cuts when writ­ing CFA di­rec­tives:

Firstly, you can up­date just the off­set, or just the reg­is­ter, with cfi_def_cfa_offset and cfi_def_cfa_register re­spec­tively. This not only saves typ­ing in the source file, it saves bytes in the table too.

Sec­ondly, you can up­date the off­set with a rel­a­tive value using cfi_adjust_cfa_offset. This is use­ful when push­ing lots of val­ues to the stack as the off­set will in­crease by eight each time.

Here's the ex­am­ple from above, but using these di­rec­tives and omit­ting the first di­rec­tive that we don't need be­cause of the CIE:

  .globl  square
  .type   square,@function
  .hidden square
square:
  .cfi_startproc
  push    rbp
  .cfi_adjust_cfa_offset 8
  mov     rbp, rsp
  .cfi_def_cfa_register rbp
  mov     DWORD PTR [rbp-4], edi
  mov     eax, DWORD PTR [rbp-4]
  imul    eax, DWORD PTR [rbp-4]
  pop     rbp
  .cfi_def_cfa rsp, 8
  ret
  .cfi_endproc

Sav­ing reg­is­ters

Con­sider a pro­filer that is un­wind­ing the stack after a pro­fil­ing sig­nal. It cal­cu­lates the CFA of the ac­tive func­tion and, from that, finds the par­ent func­tion. Now it needs to cal­cu­late the par­ent func­tion's CFA and, from the CFI ta­bles, dis­cov­ers that it's re­lated to RBX. Since RBX is a callee-saved reg­is­ter, that's rea­son­able, but the ac­tive func­tion might have stomped RBX. So, in order for the un­wind­ing to pro­ceed it needs a way to find where the ac­tive func­tion saved the old value of RBX. So there are more CFI di­rec­tives that let you doc­u­ment where reg­is­ters have been saved.

Reg­is­ters can ei­ther be saved at an off­set from the CFA (i.e. on the stack), or in an­other reg­is­ter. Most of the time they'll be saved on the stack though be­cause, if you had a caller-saved reg­is­ter to spare, you would be using it first.

To in­di­cate that a reg­is­ter is saved on the stack, use cfi_offset. In the same ex­am­ple as above (see the stack di­a­gram at the top) the caller's RBP is saved at CFA - 16 bytes. So, with saved reg­is­ters an­no­tated too, it would start like this:

square:
  .cfi_startproc
  push    rbp
  .cfi_adjust_cfa_offset 8
  .cfi_offset rbp, -16

If you need to save a reg­is­ter in an­other reg­is­ter for some rea­son, see the doc­u­men­ta­tion for cfi_register.

If you get all of that cor­rect then your de­bug­ger should be able to un­wind crashes cor­rectly, and your pro­filer should be able to avoid record­ing lots of de­tached func­tions. How­ever, I'm afraid that I don't know of a bet­ter way to test this than to zero RBP, add a crash in the as­sem­bly code, and check whether GBD can go up cor­rectly.

(None of this works for Win­dows. But Per Vognsen, via Twit­ter, notes that there are sim­i­lar di­rec­tives in MASM.)

CFI ex­pres­sions

New in ver­sion three of the DWARF stan­dard are CFI Ex­pres­sions. These de­fine a stack ma­chine for cal­cu­lat­ing the CFA value and can be use­ful when your stack frame is non-stan­dard (which is fairly com­mon in as­sem­bly code). How­ever, there's no as­sem­bler sup­port for them that I've been able to find, so one has to use cfi_es­cape and pro­vide the raw DWARF data in a .s file. As an ex­am­ple, see this ker­nel patch.

Since there's no as­sem­bler sup­port, you'll need to read sec­tion 2.5 of the stan­dard, then search for DW_CFA_def_cfa_expression and, per­haps, search for cfi_directive in OpenSSL's per­lasm script for x86-64 and the places in OpenSSL where that is used. Good luck.

(I sug­gest test­ing by adding some in­struc­tions that write to NULL in the as­sem­bly code and check­ing that gdb can cor­rectly step up the stack and that info reg shows the cor­rect val­ues for callee-saved reg­is­ters in the par­ent frame.)

CFI reg­is­ter num­bers

In case you need to use or read the raw reg­is­ter num­bers, here they are for a few ar­chi­tec­tures:

(may be EBP on MacOS) (may be ESP on MacOS)
Reg­is­ter num­ber x86-64 x86 ARM
0 RAX EAX r0
1 RDX ECX r1
2 RCX EDX r2
3 RBX EBX r3
4 RSI ESP r4
5 RDI EBP r5
6 RBP ESI r6
7 RSP EDI r7
8 R8 r8
9 R9 r9
10 R10 r10
11 R11 r11
12 R12 r12
13 R13 r13
14 R14 r14
15 R15 r15
16 RIP

(x86 val­ues taken from page 25 of this doc. x86-64 val­ues from page 57 of this doc. ARM taken from page 7 of this doc.)