softlockup问题处理

600 阅读2分钟

1.softlockup问题现象

panic处出现:"BUG: soft lockup - CPU#0 stuck for xx s! [comm:comm: pid]" image.png

2.softlockup问题原因

内核起了一个优先级为99的RT线程watchdog/x来定时刷新时间戳,如果该线程长时间得不到调度,将会触发softlockup。

3.softlockup基本原理

  1. 内核在每个CPU上都启动了一个watchdog线程,该线程被定期唤醒并记录per cpu的时间戳
  2. 同时启动per-cpu的hrtimer,当hrtimer中断到来时会触发中断处理,在中断处理函数中会读取当前时间戳
  3. 与watchdog线程记录的时间做比较,如果两者相差超过一定范围(可以配置的watchdog_thresh)就会触发soft lockup异常。

3.1 watchdog函数

主要做了两件事:

  1. 将hrtimer_interrrupts保存到soft_lockup_htrimerr_cnt中;
  2. 更新时间戳.
static void watchdog(unsigned int cpu) 
{
        __this_cpu_write(soft_lockup_hrtimer_cnt,
                         __this_cpu_read(hrtimer_interrupts));
        __touch_watchdog();
}
static void __touch_watchdog(void)
{
        __this_cpu_write(watchdog_touch_ts, get_timestamp());
}

3.2 watchdog_enable函数

主要做三件事情:

  1. 建立hrtimer定时器;
  2. 开启nmi中断,hardklockup是需要用到;
  3. 设置wathdog内核线程的优先级为FIFO MAX_RT_PRIO - 1。
static void watchdog_enable(unsigned int cpu)
{
        struct hrtimer *hrtimer = raw_cpu_ptr(&watchdog_hrtimer);

        /* kick off the timer for the hardlockup detector */
        hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        hrtimer->function = watchdog_timer_fn;

        /* Enable the perf event */
        watchdog_nmi_enable(cpu);

        /* done here because hrtimer_start can only pin to smp_processor_id() */
        hrtimer_start(hrtimer, ns_to_ktime(sample_period),
                      HRTIMER_MODE_REL_PINNED);

        /* initialize timestamp */
        watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO - 1);
        __touch_watchdog();
}

3.3 watchdog_timer_fn函数

实现如下几件事情:

  1. watchdog_interrupt_count中对hrtimer_interrupts加1,在hardlockup中会用到该值;
  2. wake_up_process(softlockup_watchdog)唤醒watchdog内核线程;
  3. is_softlockup中判断是否超时;
  4. 如果超时则打印内核线程的堆栈信息;
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
        unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
        struct pt_regs *regs = get_irq_regs();
        int duration;
        /* kick the hardlockup detector */
        watchdog_interrupt_count();

        /* kick the softlockup detector */
        wake_up_process(__this_cpu_read(softlockup_watchdog));
	
        duration = is_softlockup(touch_ts);
	if (unlikely(duration)) {      //超过则报相应错误
		pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
			smp_processor_id(), duration,
			current->comm, task_pid_nr(current));
		panic("softlockup: hung tasks");
	}
}

static int is_softlockup(unsigned long touch_ts)
{
	unsigned long now = get_timestamp();

	if (time_after(now, touch_ts + get_softlockup_thresh())) 
		return now - touch_ts;
	return 0;
}

4.softlockup问题如何处理

4.1 可能出现的情况

  1. 关抢占、关中断时间过长
  2. 中断风暴
  3. 中断isr阻塞

4.2 如何分析

  1. 分析问题调用栈,是否存在关抢占后阻塞
  2. 分析中断次数,是否出现中断风暴