WatchDog 机制原理的学习

601 阅读6分钟
  • [什么是WDT]

  • [WatchDog机制]

    • [问题一、WatchDog是如何启动的,它本质是什么]

    • [问题二、AMS、WMS等服务为什么可以被监测?]

    • [问题三、监测的服务为什么add两个一个Monitor,一个Thread?]

      • [两种监视方式:]

        • [第一类:Monitor Checker]
        • [monitor是什么呢?]
        • [第二类:Looper Checker]
        • [为什么要添加handler?]
    • [问题四、WhatDog的监测过程是什么?]

      • [一. 运行逻辑]

      • [二. 关键步骤详解]

        • [步骤二、获取所有要检查的HandlerChecker,检查所有要检查的scheduleCheckLocked进行check,30s后再次调度]
        • [步骤三、如果消息没有阻塞,postAtFrontOfQueue就会很快触发HandlerChecker的run方法。]
        • [步骤五、检查HandlerChecker的完成状态]
        • [步骤七、超时状态则获取阻塞的线程的堆栈getBlockedCheckersLocked,打印堆栈信息]
        • [步骤八、打印日志,存储堆栈信息等]
        • [步骤九、杀掉system_server重启手机]

什么是WDT

WDT(Watch Dog Timer) 最早使用在单片机导致程序“跑飞”,而WatchDog则通过定时检测,当有一些故障时让系统重启

本次主要描述Android软件层面的WatchDog,定时检查一些重要的服务,如AMS,WMS当这些服务出现故障时(死锁或消息队列不正常轮询时,杀掉System_server进程来让系统重启,同时留下堆栈信息以供开发人员分析重启原因

WatchDog机制

问题一、WatchDog是如何启动的,它本质是什么

Watchdog本质上是一个线程,所以它最核心的功能就是执行run函数的内容,下面再讲

public class Watchdog ``extends Thread {..

而它的启动在SystemServer启动时,可以看到第一个启动的就是watchdog线程,frameworks/base/services/java/com/android/server/SystemServer.java

private void startBootstrapServices() {``    ``// Start the watchdog as early as possible so we can crash the system server``    ``// if we deadlock during early boot``    ``traceBeginAndSlog(``"StartWatchdog"``);``    ``final Watchdog watchdog = Watchdog.getInstance();``//获取实例``    ``watchdog.start();``//开启线程``    ``traceEnd();

所以在手机启动时watchdog与其他服务一样被启动了,此后它就可以一直执行run里内容,开始工作了

问题二、AMS、WMS等服务为什么可以被监测?

这个也就是Watchdog所监测的对象,举例AMS

首先ActivityManagerService类本身本身就去implements了Watchdog.Monitor

public class ActivityManagerService ``extends IActivityManager.Stub``        ``implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {

随后在AMS构造函数中去添加了监听器

public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {``...``        ``Watchdog.getInstance().addMonitor(``this``);``        ``Watchdog.getInstance().addThread(mHandler);

最后在重写Watchdog中的monitor方法,这个方法内就到了watchdog关注的重点问题,这个AMS的锁中有没有死锁,具体作用后面再讲

/** In this method we try to acquire our lock to make sure that we have not deadlocked */``public void monitor() {``    ``synchronized (``this``) { }``}

可以看到AMS加入监测的三个步骤关键就是

1.继承了接口(表示该类有这个功能

2.获取Watchdog对象并添加Monitor(当前类)和Thread(当前looper)

3.实现monitor方法来获得锁

问题三、监测的服务为什么add两个一个Monitor,一个Thread?

两种监视方式:

第一类:Monitor Checker

addMonitor,即添加Monitor Checker,通过monitor() 回调监视服务关键区是否出现死锁或阻塞

将monitor对象添加到数组mMonitorQueue中

  ``private final ArrayList<Monitor> mMonitorQueue = ``new ArrayList<Monitor>();``      ``public void addMonitor(Monitor monitor) {``        ``synchronized (``this``) {``            ``mMonitorChecker.addMonitorLocked(monitor);``        ``}``    ``}``。。。``        ``void addMonitorLocked(Monitor monitor) {``            ``// We don't want to update mMonitors when the Handler is in the middle of checking``            ``// all monitors. We will update mMonitors on the next schedule if it is safe``            ``mMonitorQueue.add(monitor);``        ``}

monitor是什么呢?

从Watchdog来看,它就是一个接口,一旦当前所长时间被持有,monitor()就会一直处于wait等待状态,直到超时。

public interface Monitor {``    ``void monitor();``}

而AMS刚好已经实现了这个接口,所以AMS中的Watchdog.getInstance().addMonitor(this);这个this,其实就是这个方法的实现即添加进去了获取这个类的锁

public void monitor() {``      ``synchronized (``this``) { }``  ``}

所以addMonitor添加的是Monitor Checker来检查被监控对象的锁,当锁长时间没有被释放则当前线程死锁触发WDT

第二类:Looper Checker

为什么要添加handler?

addThread,handler我们知道是处理消息队列的,而将handler去发送消息,以监视服务主线程是否阻塞,用于检查线程的消息队列是否处于工作状态。

将HandlerChecker对象添加到mHandlerCheckers中

ArrayList<HandlerChecker> mHandlerCheckers = ``new ArrayList<>();     ``public void addThread(Handler thread) {``        ``addThread(thread, DEFAULT_TIMEOUT);``    ``}     ``public void addThread(Handler thread, ``long timeoutMillis) {``        ``synchronized (``this``) {``            ``final String name = thread.getLooper().getThread().getName();``            ``mHandlerCheckers.add(``new HandlerChecker(thread, name, timeoutMillis));``//存放线程,名字,以及超时作为HandlerChecker对象``        ``}``    ``}

而mHandlerCheckers会在初始化时加入一些重要的线程

private Watchdog() {``        ``super``(``"watchdog"``);``        ``// Initialize handler checkers for each common thread we want to check.  Note``        ``// that we are not currently checking the background thread, since it can``        ``// potentially hold longer running operations with no guarantees about the timeliness``        ``// of operations there.         ``// The shared foreground thread is the main checker.  It is where we``        ``// will also dispatch monitor checks and do other work.``        ``mMonitorChecker = ``new HandlerChecker(FgThread.getHandler(),``                ``"foreground thread"``, DEFAULT_TIMEOUT);``        ``mHandlerCheckers.add(mMonitorChecker);``        ``// Add checker for main thread.  We only do a quick check since there``        ``// can be UI running on the thread.``        ``mHandlerCheckers.add(``new HandlerChecker(``new Handler(Looper.getMainLooper()),``                ``"main thread"``, DEFAULT_TIMEOUT));``        ``// Add checker for shared UI thread.``        ``mHandlerCheckers.add(``new HandlerChecker(UiThread.getHandler(),``                ``"ui thread"``, DEFAULT_TIMEOUT));``        ``// And also check IO thread.``        ``mHandlerCheckers.add(``new HandlerChecker(IoThread.getHandler(),``                ``"i/o thread"``, DEFAULT_TIMEOUT));``        ``// And the display thread.``        ``mHandlerCheckers.add(``new HandlerChecker(DisplayThread.getHandler(),``                ``"display thread"``, DEFAULT_TIMEOUT));``        ``// And the animation thread.``        ``mHandlerCheckers.add(``new HandlerChecker(AnimationThread.getHandler(),``                ``"animation thread"``, DEFAULT_TIMEOUT));``        ``// And the surface animation thread.``        ``mHandlerCheckers.add(``new HandlerChecker(SurfaceAnimationThread.getHandler(),``                ``"surface animation thread"``, DEFAULT_TIMEOUT));

我们知道handler是Loop一个线程各种消息机制,而通过存放Handler可以获取对应线程的各个信息,即是否在正常的轮训事件,来看一下handler对象的使用就知道了

首先是构造函数中加入handler,创建HandlerChecker对象,赋值handler给全局变量mHandler

HandlerChecker(Handler handler, String name, ``long waitMaxMillis) {``           ``mHandler = handler;``           ``mName = name;``           ``mWaitMax = waitMaxMillis;``           ``mCompleted = ``true``;``       ``}

使用一、判断handler是否被卡住:即当前handler是否正在轮询更多工作,来表明环路仍然有效,而不是被卡住,如果在轮训中则表示没有被卡住,即为Completd赋为true

| public void scheduleCheckLocked() {``...``   ``if ((mMonitors.size() == ``0 && mHandler.getLooper().getQueue().isPolling())``                   ``|| (mPauseCount > ``0``)) {``                 ``mCompleted = ``true``;``               ``return``;`` ``....`` ``} | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

使用二、发送消息给当前handler,看消息是否能立即使用,若消息没有阻塞,则很快会触发HandlerChecker的run方法,否则说明已经阻塞等待了很长时间,从而触发WDT

public void run() {``    ``for (``int i=``0``; i<mHandlerCheckers.size(); i++) {``                    ``HandlerChecker hc = mHandlerCheckers.get(i);``                    ``hc.scheduleCheckLocked();``                ``}``}  ``public void scheduleCheckLocked() {``...``mHandler.postAtFrontOfQueue(``this``);``...}

所以addThread添加的是Looper Checker 来监测被监控线程的消息队列,当被监测的对象没有立刻处理当前发送的消息,则表示消息队列阻塞没有在工作,从而触发WDT

问题四、WhatDog的监测过程是什么?

一. 运行逻辑

1.whatchdog在SystemServer启动时启动线程,即开始运行其run方法,如上所说的

2.通过while进行无限循环,调用每一个HandlerChecker的scheduleCheckLocked()检查所检测对象锁和消息队列的状态

3.如果消息没有阻塞,就会很快触发HandlerChecker的run方法 ,消息是否阻塞来更新状态(消息阻塞检查Looper Checker

4.此时run方法中去获取被监控的monitor其对象锁,死锁或非死锁来更新状态 (死锁状态检查Monitor Checker

5.30s后检查HandlerChecker的完成状态

COMPLETED:完成
WAITING:还在等待,尚未超时,CHECK_INTERVAL(30s)之内
WAITED_HALF:已等待过半,尚未超时,CHECK_INTERVAL到OVERDUE之间(30s到60s之间)
OVERDUE:超时,默认60s(DEFAULT_TIMEOUT)

6.如果是非超时状态则退出当前循环,继续重复步骤2(每隔30s检查一次

7.超时状态则获取阻塞的线程的堆栈getBlockedCheckersLocked,生成描述信息

8.打印日志,保存日志堆栈信息到/data/anr下作为堆栈信息,以及Dropbox(/data/system/dropbox),同时上传到MQS

9.杀掉SysteServer进程,并重启手机

二. 关键步骤详解

步骤二、获取所有要检查的HandlerChecker,检查所有要检查的scheduleCheckLocked进行check,30s后再次调度

public void run() {``    ``boolean waitedHalf = ``false``;``    ``while (``true``) {``//无限循环``        ``...``        ``synchronized (``this``) {``            ``long timeout = CHECK_INTERVAL;``//30s``            ``...``            ``//1.调度所有的HandlerChecker,记录开始时间``            ``for (``int i=``0``; i<mHandlerCheckers.size(); i++) {``                ``HandlerChecker hc = mHandlerCheckers.get(i);``                ``hc.scheduleCheckLocked();``            ``}``            ``//2.开始定期检查超时情况``            ``//使用uptimeMillis计时,不计手机在睡眠状态度过的时间(手机睡眠时系统服务同样也在睡眠,无法响应watchdog送出的消息),以防误杀``            ``long start = SystemClock.uptimeMillis();``            ``while (timeout > ``0``) {``                ``...``                ``try {``                    ``//睡眠30s``                    ``wait(timeout);``                ``} ``catch (InterruptedException e) {``                    ``Log.wtf(TAG, e);``                ``}``                ``...``                ``timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);``            ``}

如果检查的对象为空或当前线程正在正常轮训则直接返回完成状态进行下次循环,否则记录当前消息发送时间,并给当前队列发送消息

| public void scheduleCheckLocked() {``    ``if (mCompleted) { ``//将所有监听的对象放在mMonitors中``        ``mMonitors.addAll(mMonitorQueue);``        ``mMonitorQueue.clear();``//清空该队列``    ``}``    ``if ((mMonitors.size() == ``0 && mHandler.getLooper().getQueue().isPolling())``            ``|| (mPauseCount > ``0``)) { ``//如果没有要坚挺的对象或当前消息队列正常轮训,则状态置为完成``        ``mCompleted = ``true``;``        ``return``;``    ``}``    ``if (!mCompleted) {``        ``return``;``    ``}``    ``//正在处理某个消息,有阻塞的可能,通过postAtFrontOfQueue发出消息到消息队列``    ``mCompleted = ``false``;``//将mComplete置为false,标明已经发出一个消息正在等待处理``    ``mCurrentMonitor = ``null``;``    ``mStartTime = SystemClock.uptimeMillis();``//记录下当前系统时间``    ``mHandler.postAtFrontOfQueue(``this``);``//发出消息到消息队列,去执行HandlerChecker的run方法``} | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

步骤三、如果消息没有阻塞,postAtFrontOfQueue就会很快触发HandlerChecker的run方法。

此时获取监测的对象锁,若死锁则会开始等待,若超时则表明死锁,待触发步骤5的检查状态时就有未完成的状态

public void run() {``     ``//遍历所有添加进来的mMonitors,即此时去获取每个要监听的对象的锁``    ``final int size = mMonitors.size();``    ``for (``int i = ``0 ; i < size ; i++) {``        ``synchronized (Watchdog.``this``) {``            ``mCurrentMonitor = mMonitors.get(i);``        ``}``        ``mCurrentMonitor.monitor();``//获取对象锁,若死锁时就会开始等待此时mCompleted = false``    ``}``   ``//遍历完所有的对象锁(没有等待死锁后,则将状态mCompleted置为true表示执行完成,没有死锁``    ``synchronized (Watchdog.``this``) {``        ``mCompleted = ``true``;``        ``mCurrentMonitor = ``null``;``    ``}``}

如AMS的重写的monitor();方法,则mCurrentMonitor.monitor()就是调用到了AMS中这个方法,来获取锁,若该锁被持有就会进行等待

public void monitor() {``      ``synchronized (``this``) { }``  ``}

遍历完所有对象锁都没有超时的状态就更新状态为完成

步骤五、检查HandlerChecker的完成状态

public void run() {``...``                ``//3.检查HandlerChecker的完成状态``                ``final int waitState = evaluateCheckerCompletionLocked();``                ``if (waitState == COMPLETED) {``//完成``                    ``...``                    ``continue``;``                ``} ``else if (waitState == WAITING) {``//还在等待,尚未超时(少于30s)``                    ``...``                    ``continue``;``                ``} ``else if (waitState == WAITED_HALF) {``//还在等待,尚未超时(30s到60s)``                    ``...``                    ``//1.上传到MQS``                    ``//2.打印stack trace``                    ``continue``;``                ``}``                ``...``//超时,开始打印日志等``  }

evaluateCheckerCompletionLocked来获取完成状态

COMPLETED,WAITING,WAITED_HALF,OVERDUE

private int evaluateCheckerCompletionLocked() {``    ``int state = COMPLETED;``    ``for (``int i=``0``; i<mHandlerCheckers.size(); i++) {``        ``HandlerChecker hc = mHandlerCheckers.get(i);``        ``state = Math.max(state, hc.getCompletionStateLocked());``//取最大值``    ``}``    ``return state;``}

getCompletionStateLocked若完成则COMPLETED,当前时间减去短信发送时间scheduleCheckLocked中记录的开始时间

public int getCompletionStateLocked() {``    ``if (mCompleted) { ``//如果在HandlerChecker的run方法中没有超时现象则将其置为true``        ``return COMPLETED;``    ``} ``else {``        ``long latency = SystemClock.uptimeMillis() - mStartTime; ``//当前时间减去发送消息的时间``        ``if (latency < mWaitMax/``2``) {``            ``return WAITING;``        ``} ``else if (latency < mWaitMax) {``            ``return WAITED_HALF;``        ``}``    ``}``    ``return OVERDUE;``}

最大时间mWaitMax会有在添加addmonitor时不同服务线程有不同的最大等待时间,若当前时间大于最大等待时间则表现为超时(因为HandlerChecker的run方法是开启另外线程去运行,所以此时为并行操作

如PKMS的锁最大超时间为十分钟

static final long WATCHDOG_TIMEOUT = ``1000``*``60``*``10``;     ``// ten minutes``Watchdog.getInstance().addThread(mHandler, WATCHDOG_TIMEOUT);

步骤七、超时状态则获取阻塞的线程的堆栈getBlockedCheckersLocked,打印堆栈信息

public void run() {``...``                ``//OVERDUE, 存在超时,获取超时的HandlerChecker``                ``blockedCheckers = getBlockedCheckersLocked();``                ``//生成描述信息``                ``subject = describeCheckersLocked(blockedCheckers);

获取超时的HandlerChecker

private ArrayList<HandlerChecker> getBlockedCheckersLocked() {``    ``ArrayList<HandlerChecker> checkers = ``new ArrayList<HandlerChecker>();``    ``for (``int i=``0``; i<mHandlerCheckers.size(); i++) {``        ``HandlerChecker hc = mHandlerCheckers.get(i);``        ``if (hc.isOverdueLocked()) {``//将超时的HandlerChecker添加进去``            ``checkers.add(hc);``        ``}``    ``}``    ``return checkers;``}``...``        ``boolean isOverdueLocked() {``            ``return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);``//未完成,并且当前时间大于开始时间+等待时间``        ``}

将超时的HandlerChecker的堆栈信息进行描述,可以看到当前阻塞的要么是handler即阻塞在线程上

否则就是montior即获取对应的对象锁

    ``private String describeCheckersLocked(List<HandlerChecker> checkers) {``        ``StringBuilder builder = ``new StringBuilder(``128``);``//利于StringBuilder来构建字符串``        ``for (``int i=``0``; i<checkers.size(); i++) {``            ``if (builder.length() > ``0``) {``                ``builder.append(``", "``);``            ``}``            ``builder.append(checkers.get(i).describeBlockedStateLocked());``//进行描述``        ``}``        ``return builder.toString();``//返回当前阻塞的类名或线程名``    ``} ...``        ``String describeBlockedStateLocked() {``            ``if (mCurrentMonitor == ``null``) {``                ``return "Blocked in handler on " + mName + ``" (" + getThread().getName() + ``")"``;``            ``} ``else {``                ``return "Blocked in monitor " + mCurrentMonitor.getClass().getName()``//类名``                        ``+ ``" on " + mName + ``" (" + getThread().getName() + ``")"``;``//线程名``            ``}``        ``}

步骤八、打印日志,存储堆栈信息等

保存到Dropbox中

final File stack = dumpTracesFile(waitedHalf, pids, subject); Thread dropboxThread = ``new Thread(``"watchdogWriteToDropbox"``) {``        ``public void run() {``            ``// If a watched thread hangs before init() is called, we don't have a``            ``// valid mActivity. So we can't log the error to dropbox.``            ``if (mActivity != ``null``) {``                ``mActivity.addErrorToDropBox(``                        ``"watchdog"``, ``null``, ``"system_server"``, ``null``, ``null``, ``null``,``                        ``name.isEmpty() ? subject : name, cpuInfo, stack, ``null``);``            ``}``            ``StatsLog.write(StatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);``        ``}``    ``};``dropboxThread.start();

上传到MQS

WatchdogInjector.onWatchdog(MQSEvent.EVENT_JWDT, Process.myPid(), subject,``         ``stack,getBlockedCheckersLocked());

logcat打印堆栈信息

Slog.w(TAG, ``"*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);``              ``WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);``              ``/// @}               ``Slog.w(TAG, ``"*** GOODBYE!"``);

data/anr下信息

if (mOpenFdMonitor != ``null``) {``    ``fdLimitTriggered = mOpenFdMonitor.monitor();``}

步骤九、杀掉system_server重启手机

public void run() {``...``  ``//发送single 9(SIGNAL_KILL)杀掉系统进程``            ``Process.killProcess(Process.myPid());``            ``System.exit(``10``);

问题处理

对于死机问题,我们需要做一些分析前的准备工作:
(1)拿到问题现场,及时充电以保证问题现场不被破坏;
(2)如果没有现场可以忽略这一步,通过kill -3 后面跟上system_server pid命令产生一份最新的traces文件;
(3)如果最新的traces文件无法产生,则通过debuggerd -b $system_server pid打印出一份所有线程的Native调用栈到文件中;
(4)通过adb将/data/anr下的文件都pull出来;
(5)通过adb将/data/tombstones下的文件都pull出来;