1.hardlockup问题现象
panic时候出现:"Watchdog detected hard LOCKUP on cpu "
2.hardlockup问题原因
内核检测到cpu长时间没有产生中断,触发的报错。
3.hardlockup基本原理
方式一:
- 采用比普通中断优先级更高的中断来监控普通中断的运行情况,通常为NMI不可屏蔽中断,其回调函数为watchdog_overflow_callback。
- 普通中断函数watchdog_timer_fn每次执行的时候会更新hrtimer_interrupts,hrtimer_interrupts记录中断的次数。
- 当NMI产生watchdog_overflow_callback得到执行会将当前的hrtimer_interrupts和上次记录在hrtimer_interrupts_saved中的值作比较,如果两个值相等则说明timer中断没有得到执行,系统此时存在异常。
方式二:
- 在每个CPU的watchdog_timer_fn中调用watchdog_check_hardlockup_other_cpu,检查next cpu的中断计数是否有更新
- 该feature用于不支持NMI中断的场景
下面介绍方式一的实现:
3.1 watchdog_nmi_enable函数
首先设定nmi的中断间隔为10s,然后设定其回调函数为wathdog_overflow_callback。
int __read_mostly watchdog_thresh = 10;
static int watchdog_nmi_enable(unsigned int cpu)
{
wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh);
event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback, NULL);
}
3.2 watchdog_timer_fn函数
watchdog_interrupt_count中对hrtimer_interrupts加1。
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
struct pt_regs *regs = get_irq_regs();
int duration;
/* kick the hardlockup detector */
watchdog_interrupt_count();
}
3.3 watchdog_overflow_callback函数
watchdog_overflow_callback中会根据is_hardlockup中判断是否超时:
static void watchdog_overflow_callback(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs)
{
if (is_hardlockup()) {
int this_cpu = smp_processor_id();
/* only print hardlockups once */
if (__this_cpu_read(hard_watchdog_warn) == true)
return;
pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n",
this_cpu);
print_modules();
print_irqtrace_events(current);
if (regs)
show_regs(regs);
else
dump_stack();
if (hardlockup_panic)
nmi_panic(regs, "Hard LOCKUP");
__this_cpu_write(hard_watchdog_warn, true);
return;
}
}
static bool is_hardlockup(void)
{
unsigned long hrint = __this_cpu_read(hrtimer_interrupts);
if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
return true;
__this_cpu_write(hrtimer_interrupts_saved, hrint);
}
4.hardlockup问题如何处理
4.1 可能出现的情况
- 关中断时间过长
- 某个CPU hung死