Android Watchdog 狗子到底做了啥

2,346 阅读11分钟

前言

有一定开发经验的或多或少有听过Watchdog,那什么是Watchdog呢?Watchdog又称看门狗,看门狗是育碧开发的一款游戏,目前已出到《看门狗军团》。开个玩笑,Watchdog是什么,为什么会设计出它,听到它也许能快速联想到死锁,它是一个由SystemServer启动的服务,本质上是一个线程,这次我们就从源码的角度分析,它到底做了啥。

准备

当然看源码前还需要做一些准备,不然你可能会直接看不懂。首先,Handler机制要了解。锁和死锁的概念都要了解,但我感觉应都是了解了死锁之后才听说Watchdog的。SystemServer至少得知道是做什么的。Monitor的设计思想懂更好,不懂在这里也不会影响看主流程。

这里源码有两个重要的类HandlerChecker和Monitor,简单了解它的流程大概就是用handler发消息给监控的线程,然后计时,如果30秒内有收到消息,什么都不管,如果超过30秒没收到但60秒内有收到,就打印,如果60秒内没收到消息,就炸。

主要流程源码解析

PS:源码是29的

首先在SystemServer中创建并启动这个线程,你也可以说启动这个服务

private void startBootstrapServices() {
    ......
    final Watchdog watchdog = Watchdog.getInstance();
    watchdog.start();
    ......
    watchdog.init(mSystemContext, mActivityManagerService);
    ......
}

单例,我们看看构造方法

private Watchdog() {
    super("watchdog");
    mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
            "foreground thread", DEFAULT_TIMEOUT);
    mHandlerCheckers.add(mMonitorChecker);
    // Add checker for main thread.  We only do a quick check since there
    // can be UI running on the thread.
    mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
            "main thread", DEFAULT_TIMEOUT));
    // Add checker for shared UI thread.
    mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
            "ui thread", DEFAULT_TIMEOUT));
    // And also check IO thread.
    mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
            "i/o thread", DEFAULT_TIMEOUT));
    // And the display thread.
    mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
            "display thread", DEFAULT_TIMEOUT));
    // And the animation thread.
    mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(),
            "animation thread", DEFAULT_TIMEOUT));
    // And the surface animation thread.
    mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(),
            "surface animation thread", DEFAULT_TIMEOUT));

    // 看主流程的话,Binder threads可以先不用管
    // Initialize monitor for Binder threads.
    addMonitor(new BinderThreadMonitor());

    mOpenFdMonitor = OpenFdMonitor.create();

    // See the notes on DEFAULT_TIMEOUT.
    assert DB ||
            DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}

看主流程的话,Binder threads可以先不用管,精讲。可以明显的看到这里就是把一些重要的线程的handler去创建HandlerChecker对象放到数组mHandlerCheckers中。简单理解成创建一个对象去集合这些线程的信息,并且Watchdog有个线程信息对象数组。

public final class HandlerChecker implements Runnable {
    private final Handler mHandler;
    private final String mName;
    private final long mWaitMax;
    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
    private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
    private boolean mCompleted;
    private Monitor mCurrentMonitor;
    private long mStartTime;
    private int mPauseCount;

    HandlerChecker(Handler handler, String name, long waitMaxMillis) {
        mHandler = handler;
        mName = name;
        mWaitMax = waitMaxMillis;
        mCompleted = true;
    }
    
    ......
}

然后我们先看init方法

public void init(Context context, ActivityManagerService activity) {
    mActivity = activity;
    context.registerReceiver(new RebootRequestReceiver(),
            new IntentFilter(Intent.ACTION_REBOOT),
            android.Manifest.permission.REBOOT, null);
}
final class RebootRequestReceiver extends BroadcastReceiver {
    @Override
    public void onReceive(Context c, Intent intent) {
        if (intent.getIntExtra("nowait", 0) != 0) {
            rebootSystem("Received ACTION_REBOOT broadcast");
            return;
        }
        Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
    }
}
void rebootSystem(String reason) {
    Slog.i(TAG, "Rebooting system because: " + reason);
    IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
    try {
        pms.reboot(false, reason, false);
    } catch (RemoteException ex) {
    }
}

明显能看出是重启的操作,注册广播,接收到这个广播之后重启。这个不是主流程,简单看看就行。

来了,重点来了,开始讲主流程。Watchdog是继承Thread,所以上面调start方法会执行到这里的run方法,润起来

@Override
public void run() {
    boolean waitedHalf = false;
    while (true) {
        ......
        synchronized (this) {
            long timeout = CHECK_INTERVAL;
            for (int i=0; i<mHandlerCheckers.size(); i++) {
                HandlerChecker hc = mHandlerCheckers.get(i);
                hc.scheduleCheckLocked();
            }

            ......
            long start = SystemClock.uptimeMillis();
            while (timeout > 0) {
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                try {
                    wait(timeout);
                } catch (InterruptedException e) {
                    Log.wtf(TAG, e);
                }
                ......
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            }

            ......

            if (!fdLimitTriggered) {
                // 直接先理解成正常情况下会进这里
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        Slog.i(TAG, "WAITED_HALF");
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(pids, null, null,
                            getInterestingNativePids());
                        waitedHalf = true;
                    }
                    continue;
                }

                // something is overdue!
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
            } else {
                ......
            }
            ......
        }

        // 扒日志然后退出
        ......

        waitedHalf = false;
    }
}

把一些代码屏蔽了,这样看会比较舒服,主要是怕代码太多劝退人。

首先死循环,然后遍历mHandlerCheckers,就是我们在构造方法那创建的HandlerCheckers数组,遍历数组调用HandlerChecker的scheduleCheckLocked方法

public void scheduleCheckLocked() {
    if (mCompleted) {
        // Safe to update monitors in queue, Handler is not in the middle of work
        mMonitors.addAll(mMonitorQueue);
        mMonitorQueue.clear();
    }
    if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
            || (mPauseCount > 0)) {
        mCompleted = true;
        return;
    }
    if (!mCompleted) {
        // we already have a check in flight, so no need
        return;
    }

    mCompleted = false;
    mCurrentMonitor = null;
    mStartTime = SystemClock.uptimeMillis();
    mHandler.postAtFrontOfQueue(this);
}

HandlerChecker内有个Monitor数组,Monitor是一个接口,然后外部一些类实现这个接口实现monitor方法,这个后面会说。

public interface Monitor {
    void monitor();
}

这个mCompleted默认是true

if (mCompleted) {
    // Safe to update monitors in queue, Handler is not in the middle of work
    mMonitors.addAll(mMonitorQueue);
    mMonitorQueue.clear();
}

把mMonitorQueue数组中的元素移动到mMonitors中。这个什么意思呢?有点难解释,这样,你想想,Watchdog的run方法中是一个死循环不断调用scheduleCheckLocked方法吧,我这段代码的逻辑操作用到mMonitors,那我不能在我操作的同时你添加元素进来吧,那不就乱套了,所以如果有新加Monitor的话,就只能在每次循环执行这段逻辑开始的时候,添加进了。这段代码是这个意思。

if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
        || (mPauseCount > 0)) {
    mCompleted = true;
    return;
}

如果mMonitors数组不为空,并且这个handler的messageQueue正在工作,你理解这个isPolling方法是正在工作就行,把mCompleted状态设true,然后直接结束这个方法,这什么意思呢?你想想,我的目的是要判断这个线程是否卡住了,那我messageQueue正在工作说明没卡住嘛。看不懂这里的话可以再理解理解handler机制。

假如没有,我们往下走

// 先不管,先标记这里是A1点
if (!mCompleted) {
    // we already have a check in flight, so no need
    return;
}

这段不用管它,从上面可以看出这里mCompleted是true,往下走,我们先标记这里是A1点,后面流程会执行回来。

mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);

把mCompleted状态设为false,mStartTime用来记录当前时间作为我们整个判断的起始时间,用handler发消息postAtFrontOfQueue。然后这里传this,就会调用到这个HandlerChecker自身的run方法。

好了,考验功底的地方,这个run方法是执行在哪个线程中?

@Override
public void run() {
    final int size = mMonitors.size();
    for (int i = 0 ; i < size ; i++) {
        synchronized (Watchdog.this) {
            mCurrentMonitor = mMonitors.get(i);
        }
        mCurrentMonitor.monitor();
    }

    synchronized (Watchdog.this) {
        mCompleted = true;
        mCurrentMonitor = null;
    }
}

这里是拿mMonitors数组循环遍历然后执行monitor方法,其实这个就是判断死锁的逻辑,你先简单理解成如果发生死锁,这个mCurrentMonitor.monitor就会卡住在这里,不会往下执行mCompleted = true;

handler发消息的同时run方法其实已经是切线程了 ,所以Watchdog线程会继续往下执行,我们回到Watchdog的run方法

long start = SystemClock.uptimeMillis();
while (timeout > 0) {
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    try {
        wait(timeout);
        // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
    } catch (InterruptedException e) {
        Log.wtf(TAG, e);
    }
    if (Debug.isDebuggerConnected()) {
        debuggerWasConnected = 2;
    }
    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}

wait(timeout);进行线程阻塞,线线程生命周期变成TIME_WAITTING,timeout在这里是CHECK_INTERVAL,就是30秒。

30秒之后进入这个流程

final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
    // The monitors have returned; reset
    waitedHalf = false;
    continue;
} else if (waitState == WAITING) {
    // still waiting but within their configured intervals; back off and recheck
    continue;
} else if (waitState == WAITED_HALF) {
    if (!waitedHalf) {
        Slog.i(TAG, "WAITED_HALF");
        // We've waited half the deadlock-detection interval.  Pull a stack
        // trace and wait another half.
        ArrayList<Integer> pids = new ArrayList<Integer>();
        pids.add(Process.myPid());
        ActivityManagerService.dumpStackTraces(pids, null, null,
            getInterestingNativePids());
        waitedHalf = true;
    }
    continue;
}
private int evaluateCheckerCompletionLocked() {
    int state = COMPLETED;
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        state = Math.max(state, hc.getCompletionStateLocked());
    }
    return state;
}

evaluateCheckerCompletionLocked就是轮询调用HandlerChecker的getCompletionStateLocked方法,然后根据全部的状态,返回一个最终的状态, 我后面会解释状态。 ,先看getCompletionStateLocked方法 (可以想想这个方法是在哪个线程中执行的)

public int getCompletionStateLocked() {
    if (mCompleted) {
        return COMPLETED;
    } else {
        long latency = SystemClock.uptimeMillis() - mStartTime;
        if (latency < mWaitMax/2) {
            return WAITING;
        } else if (latency < mWaitMax) {
            return WAITED_HALF;
        }
    }
    return OVERDUE;
}

其实HandlerChecker的getCompletionStateLocked方法对应scheduleCheckLocked方法。

判断mCompleted为true的话返回COMPLETED状态。COMPLETED状态就是正常,从上面看出正常情况下都会返回true,只有在那条线程还卡住的情况下,返回false。什么叫“那条线程还卡住的情况”,我们在scheduleCheckLocked方法postAtFrontOfQueue之后有两种情况会出现卡住。

(1)这个Handler的MessageQueue的前一个Message一直在处理中,导致postAtFrontOfQueue在这30秒之后都没执行到run方法
(2)run方法中的mCurrentMonitor.monitor()一直卡住,30秒了还是卡住,准确来说是竞争锁处于BLOCKED状态,没能执行到mCompleted = true

这两种情况下mCompleted都为false,然后latency来计算这段时间,如果小于30秒,返回WAITING状态,如果大于30秒小于60秒,返回WAITED_HALF状态,如果大于60秒返回OVERDUE状态。

然后看回evaluateCheckerCompletionLocked方法state = Math.max(state, hc.getCompletionStateLocked());这句代码的意思就是因为我们是检测多条线程的嘛,这么多条线程里面,但凡有一条不正常,最终这个方法都返回最不正常的那个状态。

假如返回COMPLETED状态,说明这轮循环正常,开始下一轮循环判断,假如返回WAITING, 下一轮执行到HandlerChecker的scheduleCheckLocked方法的时候,就会走点A1的判断

if (!mCompleted) {
    // we already have a check in flight, so no need
    return;
}

这种情况下就不用重复发消息和记录开始时间。当返回WAITED_HALF的情况下调用dumpStackTraces收集信息,当返回OVERDUE的情况下就直接收集信息然后重启了。下面是收集信息重启的源码,不想看可以跳过。


......

// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

ArrayList<Integer> pids = new ArrayList<>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);

final File stack = ActivityManagerService.dumpStackTraces(
        pids, null, null, getInterestingNativePids());

// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(5000);

// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');

// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked.  (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
        public void run() {
            // If a watched thread hangs before init() is called, we don't have a
            // valid mActivity. So we can't log the error to dropbox.
            if (mActivity != null) {
                mActivity.addErrorToDropBox(
                        "watchdog", null, "system_server", null, null, null,
                        subject, null, stack, null);
            }
            StatsLog.write(StatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);
        }
    };
dropboxThread.start();
try {
    dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}

IActivityController controller;
synchronized (this) {
    controller = mController;
}
if (controller != null) {
    Slog.i(TAG, "Reporting stuck state to activity controller");
    try {
        Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
        // 1 = keep waiting, -1 = kill system
        int res = controller.systemNotResponding(subject);
        if (res >= 0) {
            Slog.i(TAG, "Activity controller requested to coninue to wait");
            waitedHalf = false;
            continue;
        }
    } catch (RemoteException e) {
    }
}

// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
    debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
    Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
    Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
    Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
    Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
    WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
    Slog.w(TAG, "*** GOODBYE!");
    Process.killProcess(Process.myPid());
    System.exit(10);
}

waitedHalf = false;

补充

补充一下4个状态的定义

static final int COMPLETED = 0;
static final int WAITING = 1;
static final int WAITED_HALF = 2;
static final int OVERDUE = 3;

COMPLETED是正常情况,其它都是异常情况,OVERDUE直接重启。

然后关于Monitor,可以随便拿个类来举例子,我看很多人都是用AMS,那我也用AMS吧

public class ActivityManagerService extends IActivityManager.Stub
        implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {

看到AMS实现Watchdog.Monitor,然后在AMS的构造方法中

Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
public void addMonitor(Monitor monitor) {
    synchronized (this) {
        mMonitorChecker.addMonitorLocked(monitor);
    }
}

public void addThread(Handler thread, long timeoutMillis) {
    synchronized (this) {
        final String name = thread.getLooper().getThread().getName();
        mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
    }
}

先看addThread方法,能看出,Watchdog除了自己构造函数中添加的那些线程之外,还能提供方法给外部进行添加。然后addMonitor就是把Monitor添加到mMonitorQueue里面

void addMonitorLocked(Monitor monitor) {
    // We don't want to update mMonitors when the Handler is in the middle of checking
    // all monitors. We will update mMonitors on the next schedule if it is safe
    mMonitorQueue.add(monitor);
}

之后在scheduleCheckLocked方法再把mMonitorQueue内容移动到mMonitors中,这个上面有讲了。然后来看AMS实现monitor方法。

public void monitor() {
    synchronized (this) { }
}

表面看什么都没做,实则这里有个加锁,如果这时候其它线程占有锁了,你这里调monitor就会BLOCKED,最终时间长就导致Watchdog那超时,这个上面也有讲了。

分析

首先看了源码之后我觉得总体来说不够其它功能设计的源码亮眼,比如我上篇写的线程池,感觉设计上比它就差点意思。当然也有好的地方,比如mMonitorQueue和mMonitors的设计这里。

然后从设计的角度去反推,为什么要定30秒,这个我是分析不出的,这里定30秒是有什么含义,随便差不多定一个时机,还是根据什么原理去设定的时间。

然后我觉得有个地方挺迷的,如果有懂的大佬可以解答一下。

就是getCompletionStateLocked,什么情况下会返回WAITING状态。 记录mStartTime -> sleep 30秒 -> getCompletionStateLocked,正常来看,getCompletionStateLocked中获取时间减去mStartTime肯定是会大于30秒,所以要么getCompletionStateLocked直接返回COMPLETED,要么就是WAITED_HALF或者OVERDUE,什么情况下会WAITING。

然后看源码的时候,有个地方挺有意思的,这个也可以分享一下,就是run方法中,收集信息重启那个流程,有一句注释

// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(5000);

我是没想到官方人员也这么调皮。

最后回顾一下标题,狗子到底做了什么?

现在其实去网上找,有很多人说Watchdog是为了检测死锁,然后相当于把Watchdog和死锁绑一起了。包括在SystemServer调用的时候官方也有一句注释。

// Start the watchdog as early as possible so we can crash the system server
// if we deadlock during early boot
traceBeginAndSlog("StartWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.start();
traceEnd();

if we deadlock during early boot,让人觉得就是专门处理死锁的。当然如果出现死锁的话mCurrentMonitor.monitor()会阻塞住所以能检测出来。但是我上面也说了,从源码的角度看,有两种情况会导致卡住。

(1)这个Handler的MessageQueue的前一个Message一直在处理中,导致postAtFrontOfQueue在这30秒之后都没执行到run方法
(2)run方法中的mCurrentMonitor.monitor()一直卡住,30秒了还是卡住,准确来说是竞争锁处于BLOCKED状态,没能执行到mCompleted = true

第一种情况,我如果上一个message是耗时操作,那这个run就不会执行,这种情况下可没走到死锁的判断。当然,这里都是监听的特殊的线程,主线程之类的做耗时操作也不切实际。第二种,mCurrentMonitor.monitor()一直卡住就一定是死锁了吗?我一直持有锁不释放也会导致这个结果。

所以我个人觉得这里Watchdog的作用不仅仅是为了监测死锁,而是监测一些线程,防止它们长时间被持有导致无法响应或者因为耗时操作导致无法及时响应。再看看看门狗的定义,看门狗的功能是定期的查看芯片内部的情况,一旦发生错误就向芯片发出重启信号 ,我觉得,如果单单只是为了监测死锁,那完全可以叫DeadlockWatchdog。

总结

Watchdog的主要流程是:开启一个死循环,不断给指定线程发送一条消息,然后休眠30秒,休眠结束后判断是否收到消息的回调,如果有,则正常进行下次循环,如果没收到,判断从发消息到现在的时机小于30秒不处理,大于30秒小于60秒收集信息,大于60秒收集信息并重启。

当然还有一些细节,比如判断时间是用SystemClock.uptimeMillis(),这些细节我这里就不单独讲了。

从整体来看,这个设计的思路还是挺好的,发消息后延迟然后判断有没有收到消息 ,其实这就是和判断ANR一样,埋炸弹拆炸弹的过程,是这样的一个思路。

个人比较有疑问的就是这个30秒的设计,是有什么讲究。还有上面说的,什么情况下会出现小于30秒的场景。

本文正在参加「金石计划」