死锁在日常遇见的稳定性问题中,是一个比较常见的问题,常见的死锁有以下4种情况:
- 进程重复申请同一个锁,称为AA死锁。例如,重复申请同一个自旋锁;使用读写锁,第一次申请读锁,第二次申请写锁。
- 进程申请自旋锁时没有禁止硬中断,进程获取自旋锁以后,硬中断抢占,申请同一个自旋锁。
- 两个进程都要获取锁L1与L2,进程1持有L1,再去获取锁L2,此时进程2持有锁L2,尝试获取L1,那么进程1与进程2就会死锁,成为AB-BA死锁。
- 在一个处理器上进程1持有锁L1,再去获取锁L2,在另一个处理器上进程2持有锁L2,硬中断抢占进程2以后获取锁L1,这种AB-BA死锁很隐蔽。
在 基于qemu的kdump与crash环境搭建与实例解析中介绍了如何在本地电脑搭建一个能调试kdump与crash的环境,本文记录下使用这个环境如何分析AA死锁导致的hungtask的案例。
1、准备实验环境
因为搭建kdump与crash分析环境是使用的runninglinuxkernel_5.0,所以本次实验继续在这个环境下进行。
1.1 编译ko
//在qemu中进入代码所在目录
cd /mnt/rlk_lab/rlk_senior/Volume_2/chapter_7/lab06/
//编译内核模块,生成mydev-mmap.ko
make
- 在字符设备驱动初始化时申请了内核线程lock_test1和lock_test2。
- 在字符设备驱动打开时申请了读者类型的读写信号量mmap_sem,然后一直没释放。
- 在字符设备驱动的MYDEV_CMD_GET_BUFSIZE的ioctl方法中,申请了priv->lock的互斥锁,然后一直没释放。
- 在lock_test1与lock_test2的内核线程里,申请priv->lock互斥锁。
1.2 编译test
gcc test.c -o test
- 首先
open(DEMO_DEV_NAME, O_RDWR),在字符设备驱动中打开时申请了读者类型的读写信号量mmap_sem。 ioctl(fd, MYDEV_CMD_GET_BUFSIZE, &len)在字符设备驱动的MYDEV_CMD_GET_BUFSIZE的ioctl方法中,申请了priv->lock的互斥锁。mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)调用字符设备驱动中的demodrv_mmap()。
2、开始实验
//使能hungtask,检查进程处于D状态超过一定时间就panic
enshushu:lab06# echo 1 > /proc/sys/kernel/hung_task_panic
//设置进程D状态后panic的时间
benshushu:lab06# echo 60 > /proc/sys/kernel/hung_task_timeout_secs
//加载内核模块
benshushu:lab06# insmod mydev-mmap.ko
//运行test进程
benshushu:lab06# ./test &
//查看进程信息
ps -aux
运行结果:
3、结果分析
从上面的运行结果可以看到,3个进程出现了hungtask。将dump文件与vmlinux拷贝到同一个文件夹,使用crash进行分析:
cd /mnt/crash_analyse
cp /var/crash/202212110744/dump.202212110744 ./
crash dump.202212110744 vmlinux
3.1 分析hungtask的进程ps
crash> ps | grep UN
571 1 1 ffff800064bf4600 UN 0.0 2500 1528 dhcpcd
19269 2 3 ffff800062578000 UN 0.0 0 0 [lock_test1]
19270 2 3 ffff8000638c4600 UN 0.0 0 0 [lock_test2]
19286 18281 2 ffff800062573800 UN 0.0 1760 996 test
19309 18281 3 ffff800061c3f000 UN 0.1 7112 2636 ps
从ps进程开始分析:
//栈帧编号 [栈地址] 函数名 at 函数跳转的地址
crash> bt 19309
PID: 19309 TASK: ffff800061c3f000 CPU: 3 COMMAND: "ps"
#0 [ffff800061e168d0] __switch_to at ffff0000100982dc
#1 [ffff800061e16ac0] __schedule at ffff0000118c245c
#2 [ffff800061e17150] schedule at ffff0000118c2868
#3 [ffff800061e17210] rwsem_down_read_failed at ffff0000118d5948
#4 [ffff800061e175c0] down_read at ffff0000118d1ec8
#5 [ffff800061e17660] __access_remote_vm at ffff000010674ee8
#6 [ffff800061e17760] access_remote_vm at ffff0000106752a8
#7 [ffff800061e17790] get_mm_cmdline at ffff0000109aba70
#8 [ffff800061e179a0] get_task_cmdline at ffff0000109abe74
#9 [ffff800061e179e0] proc_pid_cmdline_read at ffff0000109abf3c
#10 [ffff800061e17a70] __vfs_read at ffff0000107eb7b4
#11 [ffff800061e17aa0] vfs_read at ffff0000107ebb1c
#12 [ffff800061e17b80] ksys_read at ffff0000107ec918
#13 [ffff800061e17c30] __se_sys_read at ffff0000107eca58
#14 [ffff800061e17c80] __arm64_sys_read at ffff0000107eca08
#15 [ffff800061e17ca0] __invoke_syscall at ffff0000100c151c
#16 [ffff800061e17cc0] invoke_syscall at ffff0000100c15c8
#17 [ffff800061e17d30] el0_svc_common at ffff0000100c16fc
#18 [ffff800061e17dd0] el0_svc_handler at ffff0000100c1bf8
#19 [ffff800061e17ff0] el0_svc at ffff000010086784
PC: 0000ffffa16c44fc LR: 0000ffffa1789c30 SP: 0000ffffcab94110
X29: 0000ffffcab94110 X28: 0000000000000000 X27: 0000000000020000
X26: 0000000000000000 X25: 0000000000000006 X24: 0000000000020000
X23: 0000000000000020 X22: 0000ffffa1197010 X21: 0000ffffa1197010
X20: 0000ffffa17a2000 X19: 0000000000000006 X18: 0000000000000000
X17: 0000ffffa16c44d0 X16: 0000ffffa17a2e68 X15: 000000007fffffde
X14: 0000000000000000 X13: 0000000000000000 X12: 0000000000000000
X11: 0000ffffcab93e78 X10: 0000000000000001 X9: 0000000000000005
X8: 000000000000003f X7: 00000000fffffff9 X6: 0000ffffa16744f0
X5: 5441555441010410 X4: 0000000040100401 X3: 0000ffffa176daf0
X2: 0000000000020000 X1: 0000ffffa1197010 X0: 0000000000000006
ORIG_X0: 0000000000000006 SYSCALLNO: 3f PSTATE: 80000000
从调用栈可以看出,ps进程在__access_remote_vm()等待读写锁,那究竟是等待哪个锁呢?锁的持有者是谁呢?
int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
unsigned long addr, void *buf, int len, unsigned int gup_flags)
{
struct vm_area_struct *vma;
int write = gup_flags & FOLL_WRITE;
if (down_read_killable(&mm->mmap_sem))
return 0;
......
up_read(&mm->mmap_sem);
return len;
}
我们只要找到mm->mmap_sem的地址,然后就可以根据这个地址找到对应的mm_struct,然后根据mm_struct找到对应的task_struct,就可以知道哪个进程持有了锁。反汇编__access_remote_vm()尝试找到mm->mmap_sem的地址。
crash> dis __access_remote_vm
......
0xffff000010674ee0 <__access_remote_vm+80>: ldr x0, [sp,#48]
0xffff000010674ee4 <__access_remote_vm+84>: add x0, x0, #0x60
0xffff000010674ee8 <__access_remote_vm+88>: bl 0xffff0000118d1e3c <down_read>
由上可知,down_read()的参数就是rw_semaphore结构体指针,所以0xffff000010674ee8处的x0就是mm->mmap_sem的地址,因为rw_semaphore在mm_struct结构体中的偏移就是0x60(十进制96),所以0xffff000010674ee0处的x0(即sp+48)就是mm_struct。
crash> struct -o mm_struct
struct mm_struct {
struct {
[0] struct vm_area_struct *mmap;
[8] struct rb_root mm_rb;
[16] u64 vmacache_seqnum;
[24] unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
[32] unsigned long mmap_base;
[40] unsigned long mmap_legacy_base;
[48] unsigned long task_size;
[56] unsigned long highest_vm_end;
[64] pgd_t *pgd;
[72] atomic_t mm_users;
[76] atomic_t mm_count;
[80] atomic_long_t pgtables_bytes;
[88] int map_count;
[92] spinlock_t page_table_lock;
[96] struct rw_semaphore mmap_sem;
......
}
此时的sp是指__access_remote_vm()运行时的栈顶地址,根据 #5 [ffff800061e17660] __access_remote_vm at ffff000010674ee8可知,sp的值是0xffff800061e17660,则mm_struct的地址是0xffff800061e17660 + 0x30(十进制48)= 0xffff800061e17690,使用rd来读取地址中的值。
//mm_struct地址
crash> rd 0xffff800061e17690
ffff800061e17690: ffff800062514b40 @KQb....
则可以知道,mm_struct的地址是0xffff800062514b40,mm_struct中有个owner的成员,表示指向的task_struct。
crash> struct mm_struct.owner ffff800062514b40
owner = 0xffff800062573800
crash> struct task_struct.comm 0xffff800062573800
comm = "test\000\000\000)\000\000\000\000\000\000\000"
综上所述,test进程持有了这个锁。可以遍历rw_semaphore中的wait_list成员,哪些进程在等待锁。
//rw_semaphore = 0xffff800062514ba0 + 0x60 = 0xffff800062514ba0
crash> struct rw_semaphore -x ffff800062514ba0
struct rw_semaphore {
count = {
//表示有一个活跃的读者以及有写者在睡眠等待,或者一个写者持有了锁以及一个读者在等待
counter = 0xffffffff00000001
},
wait_list = {
next = 0xffff800061c57aa0,
prev = 0xffff800061e17590
},
wait_lock = {
raw_lock = {
{
val = {
counter = 0x0
},
{
locked = 0x0,
pending = 0x0
},
{
locked_pending = 0x0,
tail = 0x0
}
......
}
//查看哪些进程在等待
crash> list -s rwsem_waiter.task,type -h 0xffff800061c57aa0
ffff800061c57aa0
task = 0xffff800062573800
type = RWSEM_WAITING_FOR_WRITE
ffff800061e17590
task = 0xffff800061c3f000
type = RWSEM_WAITING_FOR_READ
ffff800062514ba8
task = 0x0
type = (RWSEM_WAITING_FOR_READ | unknown: 1649883138)
crash> struct task_struct.comm 0xffff800062573800
comm = "test\000\000\000)\000\000\000\000\000\000\000"
crash> struct task_struct.comm 0xffff800061c3f000
comm = "ps\000h\000\000\000)\000\000\000\000\000\000\000"
综上所述,test先拿到了mmap_sem这个rw_semaphore这个锁,一直不释放,然后又尝试获取写者类型的锁,ps进程尝试获取读者类型的锁,从而发生了死锁。
//test在等待锁的调用栈,在调用brk系统调用中尝试获取写者锁
crash> bt 19286
PID: 19286 TASK: ffff800062573800 CPU: 2 COMMAND: "test"
#0 [ffff800061c56c10] __switch_to at ffff0000100982dc
#1 [ffff800061c56e00] __schedule at ffff0000118c245c
#2 [ffff800061c57490] schedule at ffff0000118c2868
#3 [ffff800061c57550] rwsem_down_write_failed_killable at ffff0000118d9404
#4 [ffff800061c57ad0] down_write_killable at ffff0000118d2498
#5 [ffff800061c57b70] __se_sys_brk at ffff00001067ef58
#6 [ffff800061c57c80] __arm64_sys_brk at ffff00001067eeec
#7 [ffff800061c57ca0] __invoke_syscall at ffff0000100c151c
#8 [ffff800061c57cc0] invoke_syscall at ffff0000100c15c8
#9 [ffff800061c57d30] el0_svc_common at ffff0000100c16fc
#10 [ffff800061c57dd0] el0_svc_handler at ffff0000100c1bf8
#11 [ffff800061c57ff0] el0_svc at ffff000010086784
PC: 0000ffffab991004 LR: 0000ffffab9910bc SP: 0000fffff21829b0
X29: 0000fffff21829b0 X28: 00000000000002b0 X27: 0000ffffaba2f000
X26: 0000ffffaba2f000 X25: 0000000000000fff X24: 0000000000021000
X23: 0000ffffaba2fb30 X22: 0000000000001000 X21: 0000ffffaba2e000
X20: 0000000000000000 X19: 0000000000021000 X18: 0000000000000000
X17: 0000ffffaba49230 X16: 0000ffffaba2f028 X15: 0000000000000000
X14: 0000000000000000 X13: 0000000000000001 X12: 0000000000000000
X11: 0000000000000038 X10: 0101010101010101 X9: 0000ffffab8cf218
X8: 00000000000000d6 X7: 0000ffffaba2fb30 X6: fffffffffffff000
X5: 0000000000000003 X4: 0000000000000004 X3: 0000000000000000
X2: 0000000000000000 X1: 0000ffffab9432e0 X0: 0000000000000000
ORIG_X0: 0000000000000000 SYSCALLNO: d6 PSTATE: 20000000
3.2 分析hungtask的进程lock_test1
lock_test1的调用栈:
crash> bt 19269
PID: 19269 TASK: ffff800062578000 CPU: 3 COMMAND: "lock_test1"
#0 [ffff800061cee6f0] __switch_to at ffff0000100982dc
#1 [ffff800061cee8e0] __schedule at ffff0000118c245c
#2 [ffff800061ceef70] schedule at ffff0000118c2868
#3 [ffff800061cef030] schedule_preempt_disabled at ffff0000118c2a28
#4 [ffff800061cef040] __mutex_lock at ffff0000118c93a4
#5 [ffff800061cefb70] __mutex_lock_slowpath at ffff0000118c6be0
#6 [ffff800061cefb90] mutex_lock at ffff0000118c6e60
#7 [ffff800061cefc60] do_work at ffff0000098f05bc [mydev_mmap]
#8 [ffff800061cefd10] lockdep_thread1 at ffff0000098f06d8 [mydev_mmap]
#9 [ffff800061cefd80] kthread at ffff000010189c04
从调用栈可以看出,lock_test1是在调用mutex_lock尝试获取互斥锁。
void __sched mutex_lock(struct mutex *lock)
{
might_sleep();
if (!__mutex_trylock_fast(lock))
__mutex_lock_slowpath(lock);
}
//反汇编mutex_lock
crash> dis mutex_lock
0xffff0000118c6bf0 <mutex_lock>: mov x9, x30
0xffff0000118c6bf4 <mutex_lock+4>: nop
0xffff0000118c6bf8 <mutex_lock+8>: stp x29, x30, [sp,#-208]!
0xffff0000118c6bfc <mutex_lock+12>: mov x29, sp
0xffff0000118c6c00 <mutex_lock+16>: str x0, [sp,#24]
0xffff0000118c6c04 <mutex_lock+20>: mrs x0, sp_el0
可以看出str x0, [sp,#24],其中x0即struct mutex *lock的地址,那么[sp,#24]存的就是lock的地址,从bt可以看出sp是0xffff800061cefb90,则lock的地址是*(0xffff800061cefb90 + 0x18(十进制24)) = *(0xffff800061cefba8) = 0xffff800066d8f110。
crash> rd ffff800061cefba8
ffff800061cefba8: ffff800066d8f110 ...f....
//查看mutex结构体内容
crash> struct -x mutex ffff800066d8f110
struct mutex {
owner = {
counter = 0xffff800062573801
},
wait_lock = {
{
rlock = {
raw_lock = {
{
val = {
counter = 0x0
},
{
locked = 0x0,
pending = 0x0
},
{
locked_pending = 0x0,
tail = 0x0
}
}
}
}
}
},
osq = {
tail = {
counter = 0x0
}
},
wait_list = {
next = 0xffff800061d67b40,
prev = 0xffff800061cefb40
}
}
同样也可以通过do_work()计算metux的地址:
//因为do_work()mydev-mmap.ko内,所以需要先加载ko
crash> mod -s mydev_mmap /mnt/rlk_lab/rlk_senior/Volume_2/chapter_7/lab06/mydev-mmap.ko
MODULE NAME SIZE OBJECT FILE
ffff0000098f20c0 mydev_mmap 16384 /mnt/rlk_lab/rlk_senior/Volume_2/chapter_7/lab06/mydev-mmap.ko
//反编译do_work
rash> dis do_work
0xffff0000098f0578 <do_work>: mov x9, x30
0xffff0000098f057c <do_work+4>: nop
0xffff0000098f0580 <do_work+8>: stp x29, x30, [sp,#-176]!
0xffff0000098f0584 <do_work+12>: mov x29, sp
0xffff0000098f0588 <do_work+16>: str x0, [sp,#24]
0xffff0000098f058c <do_work+20>: mrs x0, sp_el0
0xffff0000098f0590 <do_work+24>: add x0, x0, #0x4e8
0xffff0000098f0594 <do_work+28>: ldr x1, [x0]
0xffff0000098f0598 <do_work+32>: str x1, [sp,#168]
0xffff0000098f059c <do_work+36>: mov x1, #0x0 // #0
0xffff0000098f05a0 <do_work+40>: mrs x0, sp_el0
0xffff0000098f05a4 <do_work+44>: str x0, [sp,#56]
0xffff0000098f05a8 <do_work+48>: ldr x0, [sp,#56]
0xffff0000098f05ac <do_work+52>: bl 0xffff0000100ff324 <get_task_mm>
0xffff0000098f05b0 <do_work+56>: str x0, [sp,#48]
0xffff0000098f05b4 <do_work+60>: ldr x0, [sp,#24]
0xffff0000098f05b8 <do_work+64>: ldr x0, [x0]
0xffff0000098f05bc <do_work+68>: bl 0xffff0000118c6bf0 <mutex_lock>
void __sched mutex_lock(struct mutex *lock)
{
might_sleep();
if (!__mutex_trylock_fast(lock))
__mutex_lock_slowpath(lock);
}
其中由ldr x0, [sp,#24]与ldr x0, [x0]可知,mutex_lock()的参数lock的值为*(*(0xffff800061cefc60 + 0x18))
crash> rd ffff800061cefc78
ffff800061cefc78: ffff800061cefd50 P..a....
crash> rd ffff800061cefd50
ffff800061cefd50: ffff800066d8f110 ...f....
crash> struct -x mutex.owner ffff800066d8f110
owner = {
counter = 0xffff800062573801
}
owner这个成员有两个作用:
- 记录该mutex对象被哪一个task持有(struct task_struct *)。如果等于NULL表示还没有被任何一个任务持有。
- 由于task struct地址是L1_CACHE_BYTES对齐的,因此这个成员的有若干的LSB可以被用于标记状态(在ARM64平台上,L1_CACHE_BYTES是64字节,因此LSB 6-bit会用于保存mutex状态信息)。具体的状态包括:
-
MUTEX_FLAG_WAITERS:wait_list非空,即有任务阻塞在该mutex锁上,owner在unlock的时候必须要执行唤醒动作。
-
MUTEX_FLAG_HANDOFF:为了防止mutex等待队列的任务饿死,在唤醒的时候top waiter会设置该flag(由于乐观自旋的task不断的插入,唤醒的top waiter也未必一定会获取到锁)。有了这个设定后,锁的owner 在解锁的时候会将该锁转交给top waiter,而不是唤醒top waiter去竞争锁。
-
MUTEX_FLAG_PICKUP:该flag表示mutex已经万事俱备(即完成了转交),只等待top waiter来持锁。
所以该mutex对象被哪一个task持有,应该对地址做对齐操作:task_struct地址 = 0xffff800062573801 & ~(0x1f) = 0xffff800062573800
#define MUTEX_FLAG_WAITERS 0x01
#define MUTEX_FLAG_HANDOFF 0x02
#define MUTEX_FLAG_PICKUP 0x04
#define MUTEX_FLAGS 0x07
crash> struct -x task_struct.comm 0xffff800062573800
comm = "test\000\000\000)\000\000\000\000\000\000\000"
所以,当前持有锁的进程是test进程。
另外,mutex中的wait_list记录的是在等待该mutex的进程,可以根据list找出有哪些在等待这个mutex:
crash> list -s mutex_waiter.task -h 0xffff800061d67b40
ffff800061d67b40
task = 0xffff8000638c4600
ffff800061cefb40
task = 0xffff800062578000
ffff800066d8f120
task = 0xffff0000098f1318
crash> struct -x task_struct.comm 0xffff8000638c4600
comm = "lock_test2\000\000\000\000\000"
crash> struct -x task_struct.comm 0xffff800062578000
comm = "lock_test1\000\000\000\000\000"
crash> struct -x task_struct.comm 0xffff0000098f1318
comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
可以看到,lock_test1与lock_test2都在等待这个互斥锁。