hardlockup问题处理

1,441 阅读1分钟

1.hardlockup问题现象

panic时候出现:"Watchdog detected hard LOCKUP on cpu "

2.hardlockup问题原因

内核检测到cpu长时间没有产生中断,触发的报错。

3.hardlockup基本原理

方式一:

  1. 采用比普通中断优先级更高的中断来监控普通中断的运行情况,通常为NMI不可屏蔽中断,其回调函数为watchdog_overflow_callback。
  2. 普通中断函数watchdog_timer_fn每次执行的时候会更新hrtimer_interrupts,hrtimer_interrupts记录中断的次数。
  3. 当NMI产生watchdog_overflow_callback得到执行会将当前的hrtimer_interrupts和上次记录在hrtimer_interrupts_saved中的值作比较,如果两个值相等则说明timer中断没有得到执行,系统此时存在异常。

方式二:

  1. 在每个CPU的watchdog_timer_fn中调用watchdog_check_hardlockup_other_cpu,检查next cpu的中断计数是否有更新
  2. 该feature用于不支持NMI中断的场景

下面介绍方式一的实现:

3.1 watchdog_nmi_enable函数

首先设定nmi的中断间隔为10s,然后设定其回调函数为wathdog_overflow_callback。

int __read_mostly watchdog_thresh = 10;
static int watchdog_nmi_enable(unsigned int cpu)
{
    wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh);

    event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback, NULL);
}

3.2 watchdog_timer_fn函数

watchdog_interrupt_count中对hrtimer_interrupts加1。

static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
        unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
        struct pt_regs *regs = get_irq_regs();
        int duration;
        /* kick the hardlockup detector */
        watchdog_interrupt_count();
}

3.3 watchdog_overflow_callback函数

watchdog_overflow_callback中会根据is_hardlockup中判断是否超时:

static void watchdog_overflow_callback(struct perf_event *event,
                 struct perf_sample_data *data,
                 struct pt_regs *regs)
{
    if (is_hardlockup()) {
        int this_cpu = smp_processor_id();

        /* only print hardlockups once */
        if (__this_cpu_read(hard_watchdog_warn) == true)
                return;

        pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n",
                 this_cpu);
        print_modules();
        print_irqtrace_events(current);
        if (regs)
                show_regs(regs);
        else
                dump_stack();
        if (hardlockup_panic)
                nmi_panic(regs, "Hard LOCKUP");

        __this_cpu_write(hard_watchdog_warn, true);
        return;
    }
}
static bool is_hardlockup(void)
{
        unsigned long hrint = __this_cpu_read(hrtimer_interrupts);

        if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
                return true;

        __this_cpu_write(hrtimer_interrupts_saved, hrint);
}

4.hardlockup问题如何处理

4.1 可能出现的情况

  1. 关中断时间过长
  2. 某个CPU hung死