1.softlockup问题现象
panic处出现:"BUG: soft lockup - CPU#0 stuck for xx s! [pid]"
2.softlockup问题原因
内核起了一个优先级为99的RT线程watchdog/x来定时刷新时间戳,如果该线程长时间得不到调度,将会触发softlockup。
3.softlockup基本原理
- 内核在每个CPU上都启动了一个watchdog线程,该线程被定期唤醒并记录per cpu的时间戳
- 同时启动per-cpu的hrtimer,当hrtimer中断到来时会触发中断处理,在中断处理函数中会读取当前时间戳
- 与watchdog线程记录的时间做比较,如果两者相差超过一定范围(可以配置的watchdog_thresh)就会触发soft lockup异常。
3.1 watchdog函数
主要做了两件事:
- 将hrtimer_interrrupts保存到soft_lockup_htrimerr_cnt中;
- 更新时间戳.
static void watchdog(unsigned int cpu)
{
__this_cpu_write(soft_lockup_hrtimer_cnt,
__this_cpu_read(hrtimer_interrupts));
__touch_watchdog();
}
static void __touch_watchdog(void)
{
__this_cpu_write(watchdog_touch_ts, get_timestamp());
}
3.2 watchdog_enable函数
主要做三件事情:
- 建立hrtimer定时器;
- 开启nmi中断,hardklockup是需要用到;
- 设置wathdog内核线程的优先级为FIFO MAX_RT_PRIO - 1。
static void watchdog_enable(unsigned int cpu)
{
struct hrtimer *hrtimer = raw_cpu_ptr(&watchdog_hrtimer);
/* kick off the timer for the hardlockup detector */
hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hrtimer->function = watchdog_timer_fn;
/* Enable the perf event */
watchdog_nmi_enable(cpu);
/* done here because hrtimer_start can only pin to smp_processor_id() */
hrtimer_start(hrtimer, ns_to_ktime(sample_period),
HRTIMER_MODE_REL_PINNED);
/* initialize timestamp */
watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO - 1);
__touch_watchdog();
}
3.3 watchdog_timer_fn函数
实现如下几件事情:
- watchdog_interrupt_count中对hrtimer_interrupts加1,在hardlockup中会用到该值;
- wake_up_process(softlockup_watchdog)唤醒watchdog内核线程;
- is_softlockup中判断是否超时;
- 如果超时则打印内核线程的堆栈信息;
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
struct pt_regs *regs = get_irq_regs();
int duration;
/* kick the hardlockup detector */
watchdog_interrupt_count();
/* kick the softlockup detector */
wake_up_process(__this_cpu_read(softlockup_watchdog));
duration = is_softlockup(touch_ts);
if (unlikely(duration)) { //超过则报相应错误
pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
smp_processor_id(), duration,
current->comm, task_pid_nr(current));
panic("softlockup: hung tasks");
}
}
static int is_softlockup(unsigned long touch_ts)
{
unsigned long now = get_timestamp();
if (time_after(now, touch_ts + get_softlockup_thresh()))
return now - touch_ts;
return 0;
}
4.softlockup问题如何处理
4.1 可能出现的情况
- 关抢占、关中断时间过长
- 中断风暴
- 中断isr阻塞
4.2 如何分析
- 分析问题调用栈,是否存在关抢占后阻塞
- 分析中断次数,是否出现中断风暴