基于qemu的kdump与crash分析死锁案例

1,507 阅读18分钟

死锁在日常遇见的稳定性问题中,是一个比较常见的问题,常见的死锁有以下4种情况:

  1. 进程重复申请同一个锁,称为AA死锁。例如,重复申请同一个自旋锁;使用读写锁,第一次申请读锁,第二次申请写锁。
  2. 进程申请自旋锁时没有禁止硬中断,进程获取自旋锁以后,硬中断抢占,申请同一个自旋锁。
  3. 两个进程都要获取锁L1与L2,进程1持有L1,再去获取锁L2,此时进程2持有锁L2,尝试获取L1,那么进程1与进程2就会死锁,成为AB-BA死锁。
  4. 在一个处理器上进程1持有锁L1,再去获取锁L2,在另一个处理器上进程2持有锁L2,硬中断抢占进程2以后获取锁L1,这种AB-BA死锁很隐蔽。

基于qemu的kdump与crash环境搭建与实例解析中介绍了如何在本地电脑搭建一个能调试kdump与crash的环境,本文记录下使用这个环境如何分析AA死锁导致的hungtask的案例。

1、准备实验环境

因为搭建kdump与crash分析环境是使用的runninglinuxkernel_5.0,所以本次实验继续在这个环境下进行。

1.1 编译ko

//在qemu中进入代码所在目录
cd /mnt/rlk_lab/rlk_senior/Volume_2/chapter_7/lab06/
//编译内核模块,生成mydev-mmap.ko
make
  1. 在字符设备驱动初始化时申请了内核线程lock_test1和lock_test2。
  2. 在字符设备驱动打开时申请了读者类型的读写信号量mmap_sem,然后一直没释放。
  3. 在字符设备驱动的MYDEV_CMD_GET_BUFSIZE的ioctl方法中,申请了priv->lock的互斥锁,然后一直没释放。
  4. 在lock_test1与lock_test2的内核线程里,申请priv->lock互斥锁。

1.2 编译test

gcc test.c -o test
  1. 首先open(DEMO_DEV_NAME, O_RDWR),在字符设备驱动中打开时申请了读者类型的读写信号量mmap_sem。
  2. ioctl(fd, MYDEV_CMD_GET_BUFSIZE, &len)在字符设备驱动的MYDEV_CMD_GET_BUFSIZE的ioctl方法中,申请了priv->lock的互斥锁。
  3. mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)调用字符设备驱动中的demodrv_mmap()

2、开始实验

//使能hungtask,检查进程处于D状态超过一定时间就panic
enshushu:lab06# echo 1 > /proc/sys/kernel/hung_task_panic
//设置进程D状态后panic的时间
benshushu:lab06# echo 60 > /proc/sys/kernel/hung_task_timeout_secs
//加载内核模块
benshushu:lab06# insmod mydev-mmap.ko
//运行test进程
benshushu:lab06# ./test &
//查看进程信息
ps -aux

运行结果:

image.png

3、结果分析

从上面的运行结果可以看到,3个进程出现了hungtask。将dump文件与vmlinux拷贝到同一个文件夹,使用crash进行分析:

cd /mnt/crash_analyse
cp /var/crash/202212110744/dump.202212110744 ./
crash dump.202212110744 vmlinux

3.1 分析hungtask的进程ps

crash> ps | grep UN
    571      1   1  ffff800064bf4600  UN   0.0    2500   1528  dhcpcd
  19269      2   3  ffff800062578000  UN   0.0       0      0  [lock_test1]
  19270      2   3  ffff8000638c4600  UN   0.0       0      0  [lock_test2]
  19286  18281   2  ffff800062573800  UN   0.0    1760    996  test
  19309  18281   3  ffff800061c3f000  UN   0.1    7112   2636  ps

从ps进程开始分析:

//栈帧编号 [栈地址] 函数名 at 函数跳转的地址
crash> bt 19309
PID: 19309  TASK: ffff800061c3f000  CPU: 3   COMMAND: "ps"
 #0 [ffff800061e168d0] __switch_to at ffff0000100982dc
 #1 [ffff800061e16ac0] __schedule at ffff0000118c245c
 #2 [ffff800061e17150] schedule at ffff0000118c2868
 #3 [ffff800061e17210] rwsem_down_read_failed at ffff0000118d5948
 #4 [ffff800061e175c0] down_read at ffff0000118d1ec8
 #5 [ffff800061e17660] __access_remote_vm at ffff000010674ee8
 #6 [ffff800061e17760] access_remote_vm at ffff0000106752a8
 #7 [ffff800061e17790] get_mm_cmdline at ffff0000109aba70
 #8 [ffff800061e179a0] get_task_cmdline at ffff0000109abe74
 #9 [ffff800061e179e0] proc_pid_cmdline_read at ffff0000109abf3c
#10 [ffff800061e17a70] __vfs_read at ffff0000107eb7b4
#11 [ffff800061e17aa0] vfs_read at ffff0000107ebb1c
#12 [ffff800061e17b80] ksys_read at ffff0000107ec918
#13 [ffff800061e17c30] __se_sys_read at ffff0000107eca58
#14 [ffff800061e17c80] __arm64_sys_read at ffff0000107eca08
#15 [ffff800061e17ca0] __invoke_syscall at ffff0000100c151c
#16 [ffff800061e17cc0] invoke_syscall at ffff0000100c15c8
#17 [ffff800061e17d30] el0_svc_common at ffff0000100c16fc
#18 [ffff800061e17dd0] el0_svc_handler at ffff0000100c1bf8
#19 [ffff800061e17ff0] el0_svc at ffff000010086784
     PC: 0000ffffa16c44fc   LR: 0000ffffa1789c30   SP: 0000ffffcab94110
    X29: 0000ffffcab94110  X28: 0000000000000000  X27: 0000000000020000
    X26: 0000000000000000  X25: 0000000000000006  X24: 0000000000020000
    X23: 0000000000000020  X22: 0000ffffa1197010  X21: 0000ffffa1197010
    X20: 0000ffffa17a2000  X19: 0000000000000006  X18: 0000000000000000
    X17: 0000ffffa16c44d0  X16: 0000ffffa17a2e68  X15: 000000007fffffde
    X14: 0000000000000000  X13: 0000000000000000  X12: 0000000000000000
    X11: 0000ffffcab93e78  X10: 0000000000000001   X9: 0000000000000005
     X8: 000000000000003f   X7: 00000000fffffff9   X6: 0000ffffa16744f0
     X5: 5441555441010410   X4: 0000000040100401   X3: 0000ffffa176daf0
     X2: 0000000000020000   X1: 0000ffffa1197010   X0: 0000000000000006
    ORIG_X0: 0000000000000006  SYSCALLNO: 3f  PSTATE: 80000000

从调用栈可以看出,ps进程在__access_remote_vm()等待读写锁,那究竟是等待哪个锁呢?锁的持有者是谁呢?

int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
		unsigned long addr, void *buf, int len, unsigned int gup_flags)
{
	struct vm_area_struct *vma;
	int write = gup_flags & FOLL_WRITE;
	if (down_read_killable(&mm->mmap_sem))
		return 0;
        ......
	up_read(&mm->mmap_sem);
	return len;
}

我们只要找到mm->mmap_sem的地址,然后就可以根据这个地址找到对应的mm_struct,然后根据mm_struct找到对应的task_struct,就可以知道哪个进程持有了锁。反汇编__access_remote_vm()尝试找到mm->mmap_sem的地址。

crash> dis __access_remote_vm
......
0xffff000010674ee0 <__access_remote_vm+80>:     ldr     x0, [sp,#48]
0xffff000010674ee4 <__access_remote_vm+84>:     add     x0, x0, #0x60
0xffff000010674ee8 <__access_remote_vm+88>:     bl      0xffff0000118d1e3c <down_read>

由上可知,down_read()的参数就是rw_semaphore结构体指针,所以0xffff000010674ee8处的x0就是mm->mmap_sem的地址,因为rw_semaphore在mm_struct结构体中的偏移就是0x60(十进制96),所以0xffff000010674ee0处的x0(即sp+48)就是mm_struct。

crash> struct -o  mm_struct
struct mm_struct {
        struct {
    [0]     struct vm_area_struct *mmap;
    [8]     struct rb_root mm_rb;
   [16]     u64 vmacache_seqnum;
   [24]     unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
   [32]     unsigned long mmap_base;
   [40]     unsigned long mmap_legacy_base;
   [48]     unsigned long task_size;
   [56]     unsigned long highest_vm_end;
   [64]     pgd_t *pgd;
   [72]     atomic_t mm_users;
   [76]     atomic_t mm_count;
   [80]     atomic_long_t pgtables_bytes;
   [88]     int map_count;
   [92]     spinlock_t page_table_lock;
   [96]     struct rw_semaphore mmap_sem;
......
}

此时的sp是指__access_remote_vm()运行时的栈顶地址,根据 #5 [ffff800061e17660] __access_remote_vm at ffff000010674ee8可知,sp的值是0xffff800061e17660,则mm_struct的地址是0xffff800061e17660 + 0x30(十进制48)= 0xffff800061e17690,使用rd来读取地址中的值。

//mm_struct地址
crash> rd 0xffff800061e17690
ffff800061e17690:  ffff800062514b40                    @KQb....

则可以知道,mm_struct的地址是0xffff800062514b40,mm_struct中有个owner的成员,表示指向的task_struct。

crash> struct mm_struct.owner ffff800062514b40
    owner = 0xffff800062573800
crash> struct task_struct.comm 0xffff800062573800
  comm = "test\000\000\000)\000\000\000\000\000\000\000"

综上所述,test进程持有了这个锁。可以遍历rw_semaphore中的wait_list成员,哪些进程在等待锁。

//rw_semaphore = 0xffff800062514ba0 + 0x60 = 0xffff800062514ba0
crash> struct rw_semaphore -x ffff800062514ba0
struct rw_semaphore {
  count = {
    //表示有一个活跃的读者以及有写者在睡眠等待,或者一个写者持有了锁以及一个读者在等待
    counter = 0xffffffff00000001
  }, 
  wait_list = {
    next = 0xffff800061c57aa0, 
    prev = 0xffff800061e17590
  }, 
  wait_lock = {
    raw_lock = {
      {
        val = {
          counter = 0x0
        }, 
        {
          locked = 0x0, 
          pending = 0x0
        }, 
        {
          locked_pending = 0x0, 
          tail = 0x0
        }
......
}
//查看哪些进程在等待
crash> list -s rwsem_waiter.task,type -h 0xffff800061c57aa0
ffff800061c57aa0
  task = 0xffff800062573800
  type = RWSEM_WAITING_FOR_WRITE
ffff800061e17590
  task = 0xffff800061c3f000
  type = RWSEM_WAITING_FOR_READ
ffff800062514ba8
  task = 0x0
  type = (RWSEM_WAITING_FOR_READ | unknown: 1649883138)

crash> struct task_struct.comm 0xffff800062573800
  comm = "test\000\000\000)\000\000\000\000\000\000\000"
crash> struct task_struct.comm 0xffff800061c3f000
  comm = "ps\000h\000\000\000)\000\000\000\000\000\000\000"

综上所述,test先拿到了mmap_sem这个rw_semaphore这个锁,一直不释放,然后又尝试获取写者类型的锁,ps进程尝试获取读者类型的锁,从而发生了死锁。

//test在等待锁的调用栈,在调用brk系统调用中尝试获取写者锁
crash> bt 19286
PID: 19286  TASK: ffff800062573800  CPU: 2   COMMAND: "test"
 #0 [ffff800061c56c10] __switch_to at ffff0000100982dc
 #1 [ffff800061c56e00] __schedule at ffff0000118c245c
 #2 [ffff800061c57490] schedule at ffff0000118c2868
 #3 [ffff800061c57550] rwsem_down_write_failed_killable at ffff0000118d9404
 #4 [ffff800061c57ad0] down_write_killable at ffff0000118d2498
 #5 [ffff800061c57b70] __se_sys_brk at ffff00001067ef58
 #6 [ffff800061c57c80] __arm64_sys_brk at ffff00001067eeec
 #7 [ffff800061c57ca0] __invoke_syscall at ffff0000100c151c
 #8 [ffff800061c57cc0] invoke_syscall at ffff0000100c15c8
 #9 [ffff800061c57d30] el0_svc_common at ffff0000100c16fc
#10 [ffff800061c57dd0] el0_svc_handler at ffff0000100c1bf8
#11 [ffff800061c57ff0] el0_svc at ffff000010086784
     PC: 0000ffffab991004   LR: 0000ffffab9910bc   SP: 0000fffff21829b0
    X29: 0000fffff21829b0  X28: 00000000000002b0  X27: 0000ffffaba2f000
    X26: 0000ffffaba2f000  X25: 0000000000000fff  X24: 0000000000021000
    X23: 0000ffffaba2fb30  X22: 0000000000001000  X21: 0000ffffaba2e000
    X20: 0000000000000000  X19: 0000000000021000  X18: 0000000000000000
    X17: 0000ffffaba49230  X16: 0000ffffaba2f028  X15: 0000000000000000
    X14: 0000000000000000  X13: 0000000000000001  X12: 0000000000000000
    X11: 0000000000000038  X10: 0101010101010101   X9: 0000ffffab8cf218
     X8: 00000000000000d6   X7: 0000ffffaba2fb30   X6: fffffffffffff000
     X5: 0000000000000003   X4: 0000000000000004   X3: 0000000000000000
     X2: 0000000000000000   X1: 0000ffffab9432e0   X0: 0000000000000000
    ORIG_X0: 0000000000000000  SYSCALLNO: d6  PSTATE: 20000000

3.2 分析hungtask的进程lock_test1

lock_test1的调用栈:

crash> bt 19269
PID: 19269  TASK: ffff800062578000  CPU: 3   COMMAND: "lock_test1"
 #0 [ffff800061cee6f0] __switch_to at ffff0000100982dc
 #1 [ffff800061cee8e0] __schedule at ffff0000118c245c
 #2 [ffff800061ceef70] schedule at ffff0000118c2868
 #3 [ffff800061cef030] schedule_preempt_disabled at ffff0000118c2a28
 #4 [ffff800061cef040] __mutex_lock at ffff0000118c93a4
 #5 [ffff800061cefb70] __mutex_lock_slowpath at ffff0000118c6be0
 #6 [ffff800061cefb90] mutex_lock at ffff0000118c6e60
 #7 [ffff800061cefc60] do_work at ffff0000098f05bc [mydev_mmap]
 #8 [ffff800061cefd10] lockdep_thread1 at ffff0000098f06d8 [mydev_mmap]
 #9 [ffff800061cefd80] kthread at ffff000010189c04

从调用栈可以看出,lock_test1是在调用mutex_lock尝试获取互斥锁。

void __sched mutex_lock(struct mutex *lock)
{
	might_sleep();

	if (!__mutex_trylock_fast(lock))
		__mutex_lock_slowpath(lock);
}
//反汇编mutex_lock
crash> dis mutex_lock
0xffff0000118c6bf0 <mutex_lock>:        mov     x9, x30
0xffff0000118c6bf4 <mutex_lock+4>:      nop
0xffff0000118c6bf8 <mutex_lock+8>:      stp     x29, x30, [sp,#-208]!
0xffff0000118c6bfc <mutex_lock+12>:     mov     x29, sp
0xffff0000118c6c00 <mutex_lock+16>:     str     x0, [sp,#24]
0xffff0000118c6c04 <mutex_lock+20>:     mrs     x0, sp_el0

可以看出str x0, [sp,#24],其中x0即struct mutex *lock的地址,那么[sp,#24]存的就是lock的地址,从bt可以看出sp是0xffff800061cefb90,则lock的地址是*(0xffff800061cefb90 + 0x18(十进制24)) = *(0xffff800061cefba8) = 0xffff800066d8f110。

crash> rd ffff800061cefba8
ffff800061cefba8:  ffff800066d8f110                    ...f....
//查看mutex结构体内容
crash> struct -x mutex ffff800066d8f110
struct mutex {
  owner = {
    counter = 0xffff800062573801
  }, 
  wait_lock = {
    {
      rlock = {
        raw_lock = {
          {
            val = {
              counter = 0x0
            }, 
            {
              locked = 0x0, 
              pending = 0x0
            }, 
            {
              locked_pending = 0x0, 
              tail = 0x0
            }
          }
        }
      }
    }
  }, 
  osq = {
    tail = {
      counter = 0x0
    }
  }, 
  wait_list = {
    next = 0xffff800061d67b40, 
    prev = 0xffff800061cefb40
  }
}

同样也可以通过do_work()计算metux的地址:

//因为do_work()mydev-mmap.ko内,所以需要先加载ko
crash> mod -s mydev_mmap /mnt/rlk_lab/rlk_senior/Volume_2/chapter_7/lab06/mydev-mmap.ko
     MODULE       NAME         SIZE  OBJECT FILE
ffff0000098f20c0  mydev_mmap  16384  /mnt/rlk_lab/rlk_senior/Volume_2/chapter_7/lab06/mydev-mmap.ko 

//反编译do_work
rash> dis do_work
0xffff0000098f0578 <do_work>:   mov     x9, x30
0xffff0000098f057c <do_work+4>: nop
0xffff0000098f0580 <do_work+8>: stp     x29, x30, [sp,#-176]!
0xffff0000098f0584 <do_work+12>:        mov     x29, sp
0xffff0000098f0588 <do_work+16>:        str     x0, [sp,#24]
0xffff0000098f058c <do_work+20>:        mrs     x0, sp_el0
0xffff0000098f0590 <do_work+24>:        add     x0, x0, #0x4e8
0xffff0000098f0594 <do_work+28>:        ldr     x1, [x0]
0xffff0000098f0598 <do_work+32>:        str     x1, [sp,#168]
0xffff0000098f059c <do_work+36>:        mov     x1, #0x0                        // #0
0xffff0000098f05a0 <do_work+40>:        mrs     x0, sp_el0
0xffff0000098f05a4 <do_work+44>:        str     x0, [sp,#56]
0xffff0000098f05a8 <do_work+48>:        ldr     x0, [sp,#56]
0xffff0000098f05ac <do_work+52>:        bl      0xffff0000100ff324 <get_task_mm>
0xffff0000098f05b0 <do_work+56>:        str     x0, [sp,#48]
0xffff0000098f05b4 <do_work+60>:        ldr     x0, [sp,#24]
0xffff0000098f05b8 <do_work+64>:        ldr     x0, [x0]
0xffff0000098f05bc <do_work+68>:        bl      0xffff0000118c6bf0 <mutex_lock>

void __sched mutex_lock(struct mutex *lock)
{
	might_sleep();

	if (!__mutex_trylock_fast(lock))
		__mutex_lock_slowpath(lock);
}

其中由ldr x0, [sp,#24]ldr x0, [x0]可知,mutex_lock()的参数lock的值为*(*(0xffff800061cefc60 + 0x18))

crash> rd ffff800061cefc78
ffff800061cefc78:  ffff800061cefd50                    P..a....
crash> rd ffff800061cefd50
ffff800061cefd50:  ffff800066d8f110                    ...f....
crash> struct -x mutex.owner ffff800066d8f110
  owner = {
    counter = 0xffff800062573801
  }

owner这个成员有两个作用:

  1. 记录该mutex对象被哪一个task持有(struct task_struct *)。如果等于NULL表示还没有被任何一个任务持有。
  2. 由于task struct地址是L1_CACHE_BYTES对齐的,因此这个成员的有若干的LSB可以被用于标记状态(在ARM64平台上,L1_CACHE_BYTES是64字节,因此LSB 6-bit会用于保存mutex状态信息)。具体的状态包括:
  • MUTEX_FLAG_WAITERS:wait_list非空,即有任务阻塞在该mutex锁上,owner在unlock的时候必须要执行唤醒动作。

  • MUTEX_FLAG_HANDOFF:为了防止mutex等待队列的任务饿死,在唤醒的时候top waiter会设置该flag(由于乐观自旋的task不断的插入,唤醒的top waiter也未必一定会获取到锁)。有了这个设定后,锁的owner 在解锁的时候会将该锁转交给top waiter,而不是唤醒top waiter去竞争锁。

  • MUTEX_FLAG_PICKUP:该flag表示mutex已经万事俱备(即完成了转交),只等待top waiter来持锁。

所以该mutex对象被哪一个task持有,应该对地址做对齐操作:task_struct地址 = 0xffff800062573801 & ~(0x1f) = 0xffff800062573800

#define MUTEX_FLAG_WAITERS	0x01
#define MUTEX_FLAG_HANDOFF	0x02
#define MUTEX_FLAG_PICKUP	0x04

#define MUTEX_FLAGS		0x07

crash> struct -x task_struct.comm 0xffff800062573800
  comm = "test\000\000\000)\000\000\000\000\000\000\000"

所以,当前持有锁的进程是test进程。

另外,mutex中的wait_list记录的是在等待该mutex的进程,可以根据list找出有哪些在等待这个mutex:

crash> list -s mutex_waiter.task -h 0xffff800061d67b40
ffff800061d67b40
  task = 0xffff8000638c4600
ffff800061cefb40
  task = 0xffff800062578000
ffff800066d8f120
  task = 0xffff0000098f1318
crash> struct -x task_struct.comm 0xffff8000638c4600
  comm = "lock_test2\000\000\000\000\000"
crash> struct -x task_struct.comm 0xffff800062578000
  comm = "lock_test1\000\000\000\000\000"
crash> struct -x task_struct.comm 0xffff0000098f1318
  comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"

可以看到,lock_test1与lock_test2都在等待这个互斥锁。