Watchdog原理以及问题初定位

1,456 阅读13分钟

Watchdog原理以及问题定位

Watchdog工作机制

一、概述

在Android中,设计了一套软件层面的Watchdog,用于保护系统的一些重要服务以及定期检测系统关键服务是否发生死锁的事件。当系统关键服务和线程长时间持锁,处于block状态的时候,通常会让Andorid系统重启,即通过Watchdog机制,将system_server进程杀死,而导致上层重启。

二、Watchdog工作流程图

image.png

三、Watchdog具体实现流程

Watchdog继承于Thread,是一个运行在system_server进程中的一个线程(即在system_server启动过程中完成初始化)。

public class Watchdog extends Thread {}

3.1 system_server下的 startBootStrapServices

【frameworks/base/services/java/com/android/server/SystemServer.java】

private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {

......

// Start the watchdog as early as possible so we can crash the system server

// if we deadlock during early boot

t.traceBegin("StartWatchdog");

final Watchdog watchdog = Watchdog.getInstance();

watchdog.start();

在system_server启动过程中,会启动诸多服务,其中调用startBootStrapServices 会启动系统的boot级服务,这里会启动 ActivityManagerService, PowerManagerService, LightsService, DisplayManagerService, PackageManagerService, UserManagerService, sensor等服务,而我们的watchdog也是在这一阶段启动,通过注释可以了解到 尽早的启动Watchodg,可以在早期系统发生死锁的时候,可以更早的使system_server崩溃,从而利于分析问题的原因。

3.2 getInstance

【frameworks/base/services/core/java/com/android/server/Watchdog.java】

private Watchdog() {};
public static Watchdog getInstance() {
 if (sWatchdog == null) { //单例模式
 sWatchdog = new Watchdog();
 }

 return sWatchdog;
 }

因为Wtachdog构造方法是私有的,所有外界获取WatchDog 就是通过getInstance(),getInstance方法可以看出Watchdog是一个单例对象,系统中只有这一处。

3.3 Watchdog的构造方法

【frameworks/base/services/core/java/com/android/server/Watchdog.java】

private Watchdog() {
 //设置线程名为Watchdog
 super("watchdog");
 // Initialize handler checkers for each common thread we want to check. 
 //请注意,我们目前没有检查后台线程,因为它可能会持有较长时间运行的操作,
 // 而不能保证那里的操作的及时性。

 //将前台线程加入监控队列
 mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
 "foreground thread", DEFAULT_TIMEOUT);
 mHandlerCheckers.add(mMonitorChecker);
 //将主线程加入监控队列
 mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
 "main thread", DEFAULT_TIMEOUT));
 //将ui线程加入监控队列
 mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
 "ui thread", DEFAULT_TIMEOUT));
 //将io线程加入
 mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
 "i/o thread", DEFAULT_TIMEOUT));
 //将dispaly线程加入
 mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
 "display thread", DEFAULT_TIMEOUT));
 // 将动画线程加入
 mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
 "animation thread", DEFAULT_TIMEOUT));
 //将SurfaceAnimationThread 加入
 mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
 "surface animation thread", DEFAULT_TIMEOUT));

 // 初始化binder线程监控 主要用于检测binder线程是否达到连接上限16个
 //如果大于等于16个 则阻塞线程等待mThreadCountDecrement唤醒
 addMonitor(new BinderThreadMonitor());

 mInterestingJavaPids.add(Process.myPid());

 // See the notes on DEFAULT_TIMEOUT.
 assert DB ||
 DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
 }

watchdog的构造方法,设置线程名、将关键线程加入到mMonitorChecker 队列当中,当Watchdog执行run方法之后,便会依次对mMonitorChecker中的HandlerChecker对象 进行轮询访问,检查是否有超时,如果存在超时情况,则触发Watchdog,发生上层重启。

3.4 HandlerChecker类

【frameworks/base/services/core/java/com/android/server/Watchdog.java】

public final class HandlerChecker implements Runnable {
 private final Handler mHandler; //需要检测的线程对应的handler
 private final String mName;//线程描述名
 private final long mWaitMax;//最长等待间隔时间
 private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
 private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
 private boolean mCompleted;//是否检查完成的标志位 空闲的话默认为true
 private Monitor mCurrentMonitor;//目前正在检查的Monitor
 private long mStartTime; //检查开始时间
 private int mPauseCount; //停止次数 如果大于0 说明此时HandlerChecker处于暂停状态

 HandlerChecker(Handler handler, String name, long waitMaxMillis) {
 mHandler = handler;
 mName = name;
 mWaitMax = waitMaxMillis;
 mCompleted = true;
 }
Used for checking status of handle threads and scheduling monitor callbacks.

HandlerChecker对象 用于存放当前handler对象,当前线程名 以及最长等待时间,HandlerChecker 是Runnable 所以run方法是其核心的部分。

3.4.1 addMonitorLocked

void addMonitorLocked(Monitor monitor) {
 // We don't want to update mMonitors when the Handler is in the middle of checking
 // all monitors. We will update mMonitors on the next schedule if it is safe
 mMonitorQueue.add(monitor);//注册的Monitor都在这里
 }
public class Watchdog extends Thread {
 public void addMonitor(Monitor monitor) {
 synchronized (this) {
 mMonitorChecker.addMonitorLocked(monitor);
 }
 }
}

addMonitorLocked用于监控binder线程(见3.3watchdog构造方法),将将monitor添加到HandlerChecker的成员变量mMonitorQueue中,这里添加的是BinderThreadMonitor对象。

  • addMonitor(): 用于监控实现了Watchdog.Monitor接口的服务.这种超时可能是”android.fg”线程消息处理得慢,也可能是monitor迟迟拿不到锁

BinderThreadMonitor

private static final class BinderThreadMonitor implements Watchdog.Monitor {
 @Override
 public void monitor() {
 Binder.blockUntilThreadAvailable();
 }
 }

blockUntilThreadAvailable最终调用的是IPCThreadState,等待有空闲的binder线程。

3.4.2 scheduleCheckLocked

public void scheduleCheckLocked() {
 if (mCompleted) {
 // Safe to update monitors in queue, Handler is not in the middle of work
 mMonitors.addAll(mMonitorQueue);
 mMonitorQueue.clear();
 }
 //检测的mMonitors为零且looper正在从队列中轮询任务 
 // 或者正在暂停的情况下 不要进行一些数据的重置
 if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
 || (mPauseCount > 0)) {
 // Don't schedule until after resume OR
 // If the target looper has recently been polling, then
 // there is no reason to enqueue our checker on it since that
 // is as good as it not being deadlocked. This avoid having
 // to do a context switch to check the thread. Note that we
 // only do this if we have no monitors since those would need to
 // be executed at this point.
 mCompleted = true;
 return;
 }
 if (!mCompleted) {
 // we already have a check in flight, so no need
 return;
 }
 //检查开始时将mCompleted 置为fasle
 mCompleted = false;
 mCurrentMonitor = null;
 mStartTime = SystemClock.uptimeMillis();
 // 往handler前面插入任务(插队)
 mHandler.postAtFrontOfQueue(this);
 }

3.4.3 isOverdueLocked

用于判断是否超时


boolean isOverdueLocked() {
 return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
 }

3.4.4 getCompletionStateLocked

public int getCompletionStateLocked() {
 if (mCompleted) {
 return COMPLETED;
 } else {
 long latency = SystemClock.uptimeMillis() - mStartTime;
 if (latency < mWaitMax/2) {
 return WAITING;
 } else if (latency < mWaitMax) {
 return WAITED_HALF;
 }
 }
 return OVERDUE;
 }

通过mStartTime、mWaitMax和mCompleted 进行状态的判断

前 mWaitMax/2 秒为WAITING状态,后mWaitMax/2为WAITED_HALF状态,大于mWaitMax为超时状态也就是OVERDUE状态

  • COMPLETED = 0 等待完成
  • WAITING = 1 等待时间为30s
  • WAITED_HALF = 2 等待时间处于30s~60s之间
  • OVERDUE = 3 等待时间超过60s

3.4.5 describeBlockedStateLocked

String describeBlockedStateLocked() {
 if (mCurrentMonitor == null) {
 return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
 } else {
 return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
 + " on " + mName + " (" + getThread().getName() + ")";
 }
 }

describeBlockedStateLocked 获取Blocked状态的描述 在哪个的HandlerChecker中 或者 当前handler执行哪个monitor。

3.4.6 HandlerChecker.run

public void run() {
 // 一旦我们到达这里,我们确保即使我们调用#addMonitorLocked也不会改变mMonitors,
 // 因为我们首先将新的监视器添加到mMonitorQueue中,
 // 并在下一次mCompleted为真时将它们移到mMonitors中,此时我们已经完成了这个方法的执行。
 // 所以这里用了两个数组分别存储Monitor
 final int size = mMonitors.size();
 for (int i = 0 ; i < size ; i++) {
 synchronized (Watchdog.this) {
 mCurrentMonitor = mMonitors.get(i);
 }
 mCurrentMonitor.monitor();
 }

 synchronized (Watchdog.this) {
 mCompleted = true;
 mCurrentMonitor = null;
 }
 }

可以看到大部分方法后缀都加上了Locked,这是因为在Watchdog调用的时候都会加上synchronize锁,保证不会发生多线程并发导致的问题

3.5 Watchdog.run

public void run() {
 boolean waitedHalf = false;
 while (true) {
 final List<HandlerChecker> blockedCheckers;
 //超时原因用于日志的输出
 final String subject;
 //是否允许重启 默认true在watchdog.setAllowRestart中会进行重新赋值
 final boolean allowRestart;
 //调试进程连接数
 int debuggerWasConnected = 0;
 synchronized (this) {
 long timeout = CHECK_INTERVAL;//CHECK_INTERVAL = 30s
 //1.30s轮询系统中所有的montior
 for (int i=0; i<mHandlerCheckers.size(); i++) {
 HandlerChecker hc = mHandlerCheckers.get(i);
 hc.scheduleCheckLocked();//见3.4.2
 }

 if (debuggerWasConnected > 0) {
 debuggerWasConnected--;
 }

 //注意:这里我们使用uptimeMillis是因为我们不想增加睡眠时的等待时间
 //如果设备处于睡眠状态,那么我们等待超时的事物也将处于睡眠状态
 //并且没有机会运行,从而导致什么时间杀死事物的判断是错误的。
 //通俗的讲就是uptimeMillis只会在设备唤醒的时候计算超时,设备休眠的话计算时间会导致错误的重启
 //2.等待30s,等待检查结果
 long start = SystemClock.uptimeMillis();
 while (timeout > 0) {
 if (Debug.isDebuggerConnected()) {
 debuggerWasConnected = 2;
 }
 try {
 wait(timeout);//触发中断,直接捕获异常,继续等待.
 // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
 } catch (InterruptedException e) {
 Log.wtf(TAG, e);
 }
 if (Debug.isDebuggerConnected()) {
 debuggerWasConnected = 2;
 }
 //有可能在wait到一半的时候发生InterruptedException 导致时间没有走完
 //只要没有消耗完timeout的值,我们就继续等待
 timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
 }

 final int waitState = evaluateCheckerCompletionLocked();
 //都完成了的话,我们就继续下一个循环,本次循环结束
 if (waitState == COMPLETED) {
 // The monitors have returned; reset
 waitedHalf = false;
 continue;
 } else if (waitState == WAITING) {
 // still waiting but within their configured intervals; back off and recheck
 // 等待的前半段时间,继续下一个循环,本次结束
 continue;
 } else if (waitState == WAITED_HALF) {
 if (!waitedHalf) {
 Slog.i(TAG, "WAITED_HALF");
 // We've waited half the deadlock-detection interval. Pull a stack
 // trace and wait another half.
 // 我们已经等待了一半的死锁检查时间,获取堆栈跟踪并等待下一段时间
 ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);
 ActivityManagerService.dumpStackTraces(pids, null, null,
 getInterestingNativePids(), null);
 waitedHalf = true;
 }
 continue;
 }

 // 走到这里意味着已经有handler超时了 即状态是overdue =60s
 // 获取超时的HandlerChecker
 blockedCheckers = getBlockedCheckersLocked();
 //获取超时HandlerChecker的信息
 subject = describeCheckersLocked(blockedCheckers);
 //将重启状态设置为true
 allowRestart = mAllowRestart;
 }

 //如果程序走到这里,意味着系统很可能已经挂起了
 //首先从系统进程的所有线程中手机堆栈跟踪,然后杀死该进程
 //这样系统才会重启
 EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

 ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids);

 long anrTime = SystemClock.uptimeMillis();
 StringBuilder report = new StringBuilder();
 report.append(MemoryPressureUtil.currentPsiState());
 ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
 StringWriter tracesFileException = new StringWriter();
 //打印java线程和native线程的堆栈
 final File stack = ActivityManagerService.dumpStackTraces(
 pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
 tracesFileException);

 // Give some extra time to make sure the stack traces get written.
 // The system's been hanging for a minute, another second or two won't hurt much.
 // 睡眠5s,确保堆栈能够完全写入文件
 SystemClock.sleep(5000);

 processCpuTracker.update();
 report.append(processCpuTracker.printCurrentState(anrTime));
 report.append(tracesFileException.getBuffer());

 // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
 // 让kernel dump全部的block线程和cpu信息
 doSysRq('w');
 doSysRq('l');

 // Try to add the error to the dropbox, but assuming that the ActivityManager
 // itself may be deadlocked. (which has happened, causing this statement to
 // deadlock and the watchdog as a whole to be ineffective)
 //将错误加入到dropbox里
 Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
 public void run() {
 // If a watched thread hangs before init() is called, we don't have a
 // valid mActivity. So we can't log the error to dropbox.
 if (mActivity != null) {
 mActivity.addErrorToDropBox(
 "watchdog", null, "system_server", null, null, null,
 subject, report.toString(), stack, null);
 }
 FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED,
 subject);
 }
 };
 dropboxThread.start();
 try {
 //等待2s让dropbox得以返回
 dropboxThread.join(2000); // wait up to 2 seconds for it to return.
 } catch (InterruptedException ignored) {}

 IActivityController controller;
 synchronized (this) {
 controller = mController;
 }
 if (controller != null) {
 Slog.i(TAG, "Reporting stuck state to activity controller");
 try {
 //由于挂起system process而禁用dump防止controller在报告错误的时候被挂起
 Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
 // 1 = keep waiting, -1 = kill system
 int res = controller.systemNotResponding(subject);
 if (res >= 0) {
 Slog.i(TAG, "Activity controller requested to coninue to wait");
 waitedHalf = false;
 continue;
 }
 } catch (RemoteException e) {
 }
 }

 // Only kill the process if the debugger is not attached.
 //只有在没有debugger链接的情况下才能杀死进程
 if (Debug.isDebuggerConnected()) {
 debuggerWasConnected = 2;
 }
 if (debuggerWasConnected >= 2) {
 Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
 } else if (debuggerWasConnected > 0) {
 Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
 } else if (!allowRestart) {
 Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
 } else {
 Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
 WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
 Slog.w(TAG, "*** GOODBYE!");
 if (!Build.IS_USER && isCrashLoopFound()
 && !WatchdogProperties.should_ignore_fatal_count().orElse(false)) {
 breakCrashLoop();
 }
 Process.killProcess(Process.myPid());
 System.exit(10);
 }

 waitedHalf = false;
 }
 }

Watchdog的run方法大体上可以分为5个步骤

  • Watchdog 运行之后,便陷入循环中,依次调用每个HandlerChecker的scheduleChekerLocked方法
  • 调度完HandlerChecker之后,便开始定期检查是否超时,每次间隔时间由CHECK_INTERVAL设置,即30s
  • 每次检查都会调用evaluateCheckerCompletionLocked来评估来评估Handlercheck的状态
  • 如果超时时间到了,还有HandlerChecker处于OverDue状态,则通过getBlockedCheckersLocked方法,获取阻塞的HandlerChecker,生成对应的描述信息
  • 尝试将一些错误加入到dropbox中,之后保存日志,包括一些运行时的堆栈信息,重启系统(调用Process.killProcess,向system_server发送9号信号)

3.6 Watchdog中一些重要的函数

3.6.1 getBlockedCheckersLocked

private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
 ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
 for (int i=0; i<mHandlerCheckers.size(); i++) { 
 HandlerChecker hc = mHandlerCheckers.get(i);
 if (hc.isOverdueLocked()) {
 checkers.add(hc);
 }
 }
 return checkers;
 }
  • 遍历所以Checker 将没有完成的checker且超时的Checker加入队列

3.6.2 describeCheckersLocked

private String describeCheckersLocked(List<HandlerChecker> checkers) {
 StringBuilder builder = new StringBuilder(128);
 for (int i=0; i<checkers.size(); i++) {
 if (builder.length() > 0) {
 builder.append(", ");
 }
 builder.append(checkers.get(i).describeBlockedStateLocked());
 }
 return builder.toString();
 }

将所有执行时间超过1分钟的handler线程或者monitor记录下来,本质上还是调用了describeBlockedStateLocked(见3.4.5)

3.6.3 ActivityManagerService.dumpStackTraces

打印java线程和native线程的堆栈

static File dumpStackTraces(ArrayList<Integer> firstPids,
 ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids,
 ArrayList<Integer> nativePids, StringWriter logExceptionCreatingFile,
 long[] firstPidOffsets) {

 ..... 
 .....
 final File tracesDir = new File(ANR_TRACE_DIR); // 路径/data/anr
 // Each set of ANR traces is written to a separate file and dumpstate will process
 // all such files and add them to a captured bug report if they're recent enough.
 //每一组anr堆栈将会写在独立的文件当中
 maybePruneOldTraces(tracesDir);

 // NOTE: We should consider creating the file in native code atomically once we've
 // gotten rid of the old scheme of dumping and lot of the code that deals with paths
 // can be removed.
 File tracesFile;
 try {
 tracesFile = createAnrDumpFile(tracesDir);
 } catch (IOException e) {
 Slog.w(TAG, "Exception creating ANR dump file:", e);
 if (logExceptionCreatingFile != null) {
 logExceptionCreatingFile.append("----- Exception creating ANR dump file -----\n");
 e.printStackTrace(new PrintWriter(logExceptionCreatingFile));
 }
 return null;
 }

 Pair<Long, Long> offsets = dumpStackTraces(
 tracesFile.getAbsolutePath(), firstPids, nativePids, extraPids);
 if (firstPidOffsets != null) {
 if (offsets == null) {
 firstPidOffsets[0] = firstPidOffsets[1] = -1;
 } else {
 firstPidOffsets[0] = offsets.first; // Start offset to the ANR trace file
 firstPidOffsets[1] = offsets.second; // End offset to the ANR trace file
 }
 }
 return tracesFile;}

3.6.4 rebootSystem

void rebootSystem(String reason) {
 Slog.i(TAG, "Rebooting system because: " + reason);
 IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
 try {
 pms.reboot(false, reason, false);
 } catch (RemoteException ex) {
 }
 }

收到reboot广播后将会调用PMS重启手机

3.6.5 doSysRq

private void doSysRq(char c) {
 try {
 FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");
 sysrq_trigger.write(c);
 sysrq_trigger.close();
 } catch (IOException e) {
 Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
 }
 }

通过向节点/proc/sysrq-trigger写入字符,触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log。

四、增加机制

细化watchdog事件的上报时机,准确获取watchdog各个阶段的打点数据

4.1 WorkHandler类

private final class WorkerHandler extends Handler {
 static final int WAITED_HALF_DUMP = 0; 
 static final int WATCH_DOG_DUMP = WAITED_HALF_DUMP + 1;//
 static final int WATCH_DOG_WITH_WAITED_HALF_DUMP = WAITED_HALF_DUMP + 2;

 public WorkerHandler(Looper looper) {
 super(looper, null);
 }
}
  • WAITED_HALF_DUMP watchdog线程等待超过30s,当前状态为waited half
  • WATCH_DOG_DUMP watchdog线程等待超过60s,watchdog已经发生,并且所有日志打印和文件写入都已完成
  • WATCH_DOG_WITH_WAITED_HALF_DUMP watchdog线程等待超过60s,watchdog已经发生,同时又等待异步子线程打印日志和写入文件超过60s,总共等待120s

4.1.1 handleMessage

public void handleMessage(Message msg) {
 ArrayList<Integer> pids = new ArrayList<>(mInterestingJavaPids); 
 boolean waitedHalf = false;
 switch (msg.what) {
 case WAITED_HALF_DUMP:
 // 已经等待了30s,打印堆栈,然后等待另一半时间
 mInitialStack = ActivityManagerService.dumpStackTraces(pids, null, null,
 getInterestingNativePids(), null);
 break;
 case WATCH_DOG_WITH_WAITED_HALF_DUMP:
 waitedHalf = true;
 case WATCH_DOG_DUMP:
 final String subject = (String)msg.obj; 
 StringBuilder report = new StringBuilder();
 final File newFd = dumpTracesFile(waitedHalf,pids, report); //生成WDT文件

 mActivity.addErrorToDropBox(
 "watchdog", null, "system_server", null, null,null,
 subject, report.toString(), newFd, null);

 synchronized (mDumpCompleteLock) {
 if (!mDumpCompleted) {
 // Trigger the kernel to dump all blocked threads, and backtraces
 // on all CPUs to the kernel log
 Slog.e(TAG, "Triggering SysRq for system_server watchdog");
 doSysRq('w');
 doSysRq('l');

 // report the watchdog and dump is completed event to MQS.
 //向MQS通知这个watchdog并且dump完成的事件
 WatchdogInjector.onWatchdog(MQSEvent.EVENT_JWDT, Process.myPid(),
 subject, newFd, getBlockedCheckersLocked());
 if (WatchdogInjector.isDebugpolicyed() && (WatchdogInjector.CheckDState(Process.myPid())
 || WatchdogInjector.checkHalsDstate())) {
 Slog.e(TAG, "Triggering SysRq Panic Dump when has D state thread")
 //等待3s,确保block的线程信息已经dump 进kernel log
 SystemClock.sleep(3000);
 // now try to crash when has D state thread
 doSysRq('c');
 }
 mDumpCompleted = true;
 }
 }
 break;
 default:
 Slog.w(TAG, " // " + "wrong message received of WorkerHandler");
 break;
 }
 }

handleMessage根据sendMessage发送回来的msg信息,执行相应操作

  • WAITED_HALF_DUMP 此时watchdog已经发生了30s,打印监控的线程的java与native堆栈
  • WATCH_DOG_WITH_WAITED_HALF_DUMP 将waitHalf 置为true;
  • WATCH_DOG_DUMP watchdog线程等待超过60s,调用dumpTracesFile生成WDT文件,向MQS打点系统通知watchdog事件和dump完成事件。

4.1.2 ensureWorkerHandlerReady

private void ensureWorkerHandlerReady() {
 if (mWorkerHandler == null) {
 mWorkerThread = new HandlerThread("watchdogWorkerThread");
 mWorkerThread.start();
 mWorkerHandler = new WorkerHandler(mWorkerThread.getLooper());
 }
 }

创建WorkHandler实例,命名为watchdogWorkerThread,并looper

4.1.3isDumpCompleted

private boolean isDumpCompleted() {
 synchronized (mDumpCompleteLock) {
 return mDumpCompleted;
 }
 }

判断是dump完成的状态

4.1.4 ensureDumpCompletedOrKernelReboot

private void ensureDumpCompletedOrKernelReboot() {
 long timeout = DEFAULT_TIMEOUT; //设置默认时间 根据DB判断 10s或者60s
 long end = SystemClock.uptimeMillis() + timeout; 
 while (!isDumpCompleted() && timeout > 0) { //如果dump没有完成且超时时间大于0则一直等待
 try {
 wait(1000); //等待1s
 } catch (InterruptedException e) {
 Log.wtf(TAG, e);
 }
 timeout = end - SystemClock.uptimeMillis();
 }
 if (timeout <= 0) {
 synchronized (mDumpCompleteLock) {
 if (!mDumpCompleted) {
 // Trigger the kernel to dump all blocked threads, and backtraces
 // on all CPUs to the kernel log
 Slog.e(TAG, "Triggering SysRq for system_server watchdog");
 doSysRq('w');
 doSysRq('l');

 // report the watchdog dump error event to MQS.
 WatchdogInjector.onWatchdog(MQSEvent.EVENT_JWDT_DUMP_ERROR, Process.myPid(),
 describeCheckersLocked(getBlockedCheckersLocked()), null);
 mDumpCompleted = true;
 } else {
 return;
 }
 }
 // now try to crash the target
 doSysRq('c');
 }
 }

watchdog超时60s,需要同时等待异步子线程打印日志和写入文件,总计120s

4.1.5 dumpTracesFile

private File dumpTracesFile(boolean waitedHalf, ArrayList<Integer> pids, StringBuilder report) {
 long anrTime = SystemClock.uptimeMillis(); //更新时间

 report.append(MemoryPressureUtil.currentPsiState()); 
 ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
 StringWriter tracesFileException = new StringWriter();

 //保存java以及native进程的堆栈
 final File finalStack = ActivityManagerService.dumpStackTraces(
 pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
 tracesFileException);

 //Collect Binder State logs to get status of all the transactions
 //收集binder 状态log来获得整个通信过程状态
 if (Build.IS_DEBUGGABLE) {
 binderStateRead();
 }

 //等待5s,确保堆栈 文件输出完成
 SystemClock.sleep(5000);

 processCpuTracker.update();
 report.append(processCpuTracker.printCurrentState(anrTime));
 report.append(tracesFileException.getBuffer());

 File watchdogTraces; //创建WDT文件
 //文件名命名方式为 traces_SystemServer_WDT_问题发时间_对应进程pid
 String newTracesPath = "traces_SystemServer_WDT" 
 + mTraceDateFormat.format(new Date()) + "_pid"
 + String.valueOf(Process.myPid());
 // 创建文件路径/data/anr
 File tracesDir = new File(ActivityManagerService.ANR_TRACE_DIR);
 //WDT文件路径 /data/anr/traces_SystemServer_WDT_time_pid_
 watchdogTraces = new File(tracesDir, newTracesPath);
 try {
 if (watchdogTraces.createNewFile()) {
 FileUtils.setPermissions(watchdogTraces.getAbsolutePath(),
 0600, -1, -1); // -rw------- permissions

 //将前30s与后30sdump出的堆栈合并在一个新文件中
 // dumpStackTraces() 可能返回空对象, 因此需要检查一下
 if (mInitialStack != null) {
 // check the last-modified time of this file.
 // we are interested in this only it was written to in the
 // last 5 minutes or so
 //检查最后写入的这个文件
 final long age = System.currentTimeMillis()
 - mInitialStack.lastModified();
 final long FIVE_MINUTES_IN_MILLIS = 1000 * 60 * 5;
 if (age < FIVE_MINUTES_IN_MILLIS) {
 Slog.e(TAG, "First set of traces taken from "
 + mInitialStack.getAbsolutePath());
 appendFile(watchdogTraces, mInitialStack);
 } else {
 Slog.e(TAG, "First set of traces were collected more than "
 + "5 minutes ago, ignoring ...");
 }
 } else {
 Slog.e(TAG, "First set of traces are empty!");
 }

 if (finalStack != null) {
 Slog.e(TAG, "Second set of traces taken from "
 + finalStack.getAbsolutePath());
 appendFile(watchdogTraces, finalStack);
 } else {
 Slog.e(TAG, "Second set of traces are empty!");
 }
 } else {
 Slog.w(TAG, "Unable to create Watchdog dump file: createNewFile failed");
 }
 } catch (Exception e) {
 // catch any exception that happens here;
 // why kill the system when it is going to die anyways?
 Slog.e(TAG, "Exception creating Watchdog dump file:", e);
 }

 mInitialStack = null;
 return watchdogTraces;
 }

dumptracesFile主要用于将前后两次trace合并到一个新文件中,方便调试分析

4.2 一些增加的监控进程(含原生)

// Which native processes to dump into dropbox's stack traces
 public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {
 "/system/bin/audioserver",
 "/system/bin/cameraserver",
 "/system/bin/drmserver",
 "/system/bin/mediadrmserver",
 "/system/bin/mediaserver",
 "/system/bin/netd",
 "/system/bin/sdcard",
 "/system/bin/surfaceflinger",
 "/system/bin/vold",
 "media.extractor", // system/bin/mediaextractor
 "media.metrics", // system/bin/mediametrics
 "media.codec", // vendor/bin/hw/android.hardware.media.omx@1.0-service
 "media.swcodec", // /apex/com.android.media.swcodec/bin/mediaswcodec
 "com.android.bluetooth", // Bluetooth service
 "/apex/com.android.os.statsd/bin/statsd", // Stats daemon
 // xxx ADD: START
 "/system/bin/gatekeeperd",
 "/system/bin/keystore"
 // END

 };

 public static final List<String> HAL_INTERFACES_OF_INTEREST = Arrays.asList(
 "android.hardware.audio@2.0::IDevicesFactory",
 "android.hardware.audio@4.0::IDevicesFactory",
 "android.hardware.audio@5.0::IDevicesFactory",
 "android.hardware.audio@6.0::IDevicesFactory",
 "android.hardware.biometrics.face@1.0::IBiometricsFace",
 "android.hardware.biometrics.fingerprint@2.1::IBiometricsFingerprint",
 "android.hardware.bluetooth@1.0::IBluetoothHci",
 "android.hardware.camera.provider@2.4::ICameraProvider",
 "android.hardware.gnss@1.0::IGnss",
 "android.hardware.graphics.allocator@2.0::IAllocator",
 "android.hardware.graphics.composer@2.1::IComposer",
 "android.hardware.health@2.0::IHealth",
 "android.hardware.light@2.0::ILight",
 "android.hardware.media.c2@1.0::IComponentStore",
 "android.hardware.media.omx@1.0::IOmx",
 "android.hardware.media.omx@1.0::IOmxStore",
 "android.hardware.neuralnetworks@1.0::IDevice",
 "android.hardware.power.stats@1.0::IPowerStats",
 "android.hardware.sensors@1.0::ISensors",
 "android.hardware.sensors@2.0::ISensors",
 "android.hardware.sensors@2.1::ISensors",
 "android.hardware.vr@1.0::IVr",
 "android.system.suspend@1.0::ISystemSuspend",
 // xxx ADD: START
 "android.hardware.keymaster@4.0::IKeymasterDevice",
 "android.hardware.keymaster@3.0::IKeymasterDevice"
 // END

这些进程信息的堆栈信息,将会在watchdog发生时写入到anr文件中以及wdt文件中

五、问题定位

5.1 初步定位

在bugreport当中发生Watchdog的日志,非常明显,要想初步定位问题,可以现在event log中搜索watchoog对应的关键字。

当发生Watchdog检测超时的系统事件时,在Android中会打印一个EventLog:

watchdog: Blocked in handler XXX # 表示HandlerChecker超时了
watchdog: Blocked in monitor XXX # 表示MonitorChecker超时了

Watchdog是运行在system_server中的一个线程,往往发生Watchog的时候,意味着此时system_server已经被kill掉,导致zygote重启。因此着这一阶段也会打印很多system的log。如果bugreprot中搜索到以下关键字,也能说明发生Watchdog。

Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: XXX
Watchdog: XXX
Watchdog: "*** GOODBYE!

当在bugreport中检索到以上信息,就需要了解发生watchdog之前 ,system_server正在干什么,处于一种什么样的状态。因此我们需要知道trace信息,cpu信息,IO信息等等,因此我们可以去/data/anr 目录下找对应时间点的trace文件进行分析。在大多数情况下,引起watchdog的直接原因是system_server中某一处线程阻塞导致。

5.2 实例log中参数含义

5.2.1 bugreprot中system log

Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.wm.WindowManagerService on foreground thread (android.fg)

这里Watchdog告诉我们Monitor Checker超时了, 名为android.fg的线程在WindowManagerService的monitor()方法被阻塞了。打开对应时间点的anr trace寻找android.fg线程

5.2.2 ANR trace 含义

下方trace调用栈表示 android.fg线程此时的状态是blocked,且由Watchdog发起调用monitor(),这是一个Watchdog检查,阻塞已经超时。该线程正在等tid=91的线程所持有的的一把0x126dccb8锁,具体阻塞原因我们需要查看tid=91线程的堆栈情况进行合理分析。

"android.fg" prio=5 tid=25 Blocked
 | group="main" sCount=1 dsCount=0 obj=0x12eef900 self=0x7f7a8b1000
 | sysTid=973 nice=0 cgrp=default sched=0/0 handle=0x7f644e9000
 | state=S schedstat=( 3181688530 2206454929 8991 ) utm=251 stm=67 core=1 HZ=100
 | stack=0x7f643e7000-0x7f643e9000 stackSize=1036KB
 | held mutexes=
 at com.android.server.wm.WindowManagerService.monitor(WindowManagerService.java:13125)
 - waiting to lock <0x126dccb8> (a java.util.HashMap) held by thread 91
 at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:204)
 at android.os.Handler.handleCallback(Handler.java:815)
 at android.os.Handler.dispatchMessage(Handler.java:104)
 at android.os.Looper.loop(Looper.java:194)
 at android.os.HandlerThread.run(HandlerThread.java:61)
 at com.android.server.ServiceThread.run(ServiceThread.java:46)

[ANR trace]参数含义

mainmain标识是主线程,如果是线程,那么命名成“Thread-X”的格式,x表示线程id,逐步递增。
mainmain标识是主线程,如果是线程,那么命名成“Thread-X”的格式,x表示线程id,逐步递增。
prio线程优先级,默认是5
tidtid不是线程的id,是线程唯一标识ID
group线程组名称
sCount该线程被挂起的次数
dsCount线程被调试器挂起的次数
obj对象地址
self线程Native的地址
sysTid线程号(主线程的线程号和进程号相同)
nice线程的调度优先级
sched分别标志了线程的调度策略和优先级
cgrp调度归属组
handle线程处理函数的地址
state调度状态
schedstat从/proc/[pid]/task/[tid]/schedstat读出,三个值分别表示线程在cpu上执行的时间、线程的等待时间和线程执行的时间片长度,不支持这项信息的三个值都是0
utm线程用户态下使用的时间值(单位是jiffies)
stm内核态下的调度时间值
core最后执行这个线程的cpu核的序号

5.3 分析流程

根据watchdog产生后,抓取的bugreport中确定watchdog发生的时间,再去找相应的时间段的anrtrace日志

  • 思考为什么会发生watchdog事件?在日志中找到发生watchdog的对应时间、对应block的线程
  • 分析线程的堆栈信息,确认线程阻塞的根本原因(是否出现死锁、binder线程池是否满了,binder通信阻塞等)根据具体堆栈具体分析。