android ： ANRAnr 触发流程埋炸弹 Context.startService 调用链如下： AMS.st

Anr 触发流程

埋炸弹

Context.startService
调用链如下：
AMS.startService
ActiveServices.startService
ActiveServices.realStartServiceLocked

private final void realStartServiceLocked(ServiceRecord r, ProcessRecord app, boolean execInFg) throws RemoteException {
    ...
    //1、这里会发送delay消息(SERVICE_TIMEOUT_MSG)
    bumpServiceExecutingLocked(r, execInFg, "create");
    try {
        ...
        //2、通知AMS创建服务
        app.thread.scheduleCreateService(r, r.serviceInfo,
                mAm.compatibilityInfoForPackageLocked(r.serviceInfo.applicationInfo),
                app.repProcState);
    } 
    ...
}

注释1的bumpServiceExecutingLocked内部调用scheduleServiceTimeoutLocked

    void scheduleServiceTimeoutLocked(ProcessRecord proc) {
        ...
        Message msg = mAm.mHandler.obtainMessage(
                ActivityManagerService.SERVICE_TIMEOUT_MSG);
        msg.obj = proc;
        // 发送deley消息，前台服务是20s，后台服务是10s
        mAm.mHandler.sendMessageDelayed(msg,
                proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT);
    }

注释2通知AMS启动服务之前，注释1处发送Handler延时消息，埋下炸弹，如果10s内（前台服务是20s）没人来拆炸弹，炸弹就会爆炸，即ActiveServices#serviceTimeout方法会被调用

拆炸弹

启动一个Service，先要经过AMS管理，然后AMS会通知应用进程执行Service的生命周期， ActivityThread的handleCreateService方法会被调用
ActivityThread#handleCreateService


    private void handleCreateService(CreateServiceData data) {
        try {
           ...
            Application app = packageInfo.makeApplication(false, mInstrumentation);
            service.attach(context, this, data.info.name, data.token, app,
                    ActivityManager.getService());
             //1、service onCreate调用
            service.onCreate();
            mServices.put(data.token, service);
            try {
            	//2、拆炸弹在这里
                ActivityManager.getService().serviceDoneExecuting(data.token, SERVICE_DONE_EXECUTING_ANON, 0, 0);
                        
            } catch (RemoteException e) {
                throw e.rethrowFromSystemServer();
            }
        }

    }

注释1，Service的onCreate方法被调用，
注释2，调用AMS的serviceDoneExecuting方法，最终会调用到ActiveServices. serviceDoneExecutingLocked

private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying, boolean finishing) { 
   // ...
   // 移除delay消息
  mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app); 
    //...
}

可以看到，onCreate方法调用完之后，就会移除delay消息，炸弹被拆除。

引爆炸弹

假设Service的onCreate执行超过10s，那么炸弹就会引爆，也就是ActiveServices#serviceTimeout方法会被调用

    void serviceTimeout(ProcessRecord proc) {

    ...
    if (anrMessage != null) {
            mAm.mAppErrors.appNotResponding(proc, null, null, false, anrMessage);
        }
    ...
    }

所有ANR，最终都会调用AppErrors的appNotResponding方法
AppErrors #appNotResponding

    final void appNotResponding(ProcessRecord app, ActivityRecord activity,
            ActivityRecord parent, boolean aboveSystem, final String annotation) {
          ...

          //1、写入event log
          // Log the ANR to the event log.
          EventLog.writeEvent(EventLogTags.AM_ANR, app.userId, app.pid,
                    app.processName, app.info.flags, annotation);
           ...
          //2、收集需要的log，anr、cpu等，StringBuilder凭借
	        // Log the ANR to the main log.
	        StringBuilder info = new StringBuilder();
	        info.setLength(0);
	        info.append("ANR in ").append(app.processName);
	        if (activity != null && activity.shortComponentName != null) {
	            info.append(" (").append(activity.shortComponentName).append(")");
	        }
	        info.append("\n");
	        info.append("PID: ").append(app.pid).append("\n");
	        if (annotation != null) {
	            info.append("Reason: ").append(annotation).append("\n");
	        }
	        if (parent != null && parent != activity) {
	            info.append("Parent: ").append(parent.shortComponentName).append("\n");
	        }

	        ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);

	       ...
        // 3、dump堆栈信息，包括java堆栈和native堆栈，保存到文件中
        // For background ANRs, don't pass the ProcessCpuTracker to
        // avoid spending 1/2 second collecting stats to rank lastPids.
        File tracesFile = ActivityManagerService.dumpStackTraces(
                true, firstPids,
                (isSilentANR) ? null : processCpuTracker,
                (isSilentANR) ? null : lastPids,
                nativePids);

        String cpuInfo = null;
        ...

		    //4、输出ANR 日志
        Slog.e(TAG, info.toString());
        if (tracesFile == null) {
             // 5、没有抓到tracesFile，发一个SIGNAL_QUIT信号
            // There is no trace file, so dump (only) the alleged culprit's threads to the log
            Process.sendSignal(app.pid, Process.SIGNAL_QUIT);
        }

        StatsLog.write(StatsLog.ANR_OCCURRED, ...)
        // 6、输出到drapbox
        mService.addErrorToDropBox("anr", app, app.processName, activity, parent, annotation, cpuInfo, tracesFile, null);

        ...

        synchronized (mService) {
            mService.mBatteryStatsService.noteProcessAnr(app.processName, app.uid);
           //7、后台ANR，直接杀进程
            if (isSilentANR) {
                app.kill("bg anr", true);
                return;
            }

           //8、错误报告
            // Set the app's notResponding state, and look up the errorReportReceiver
            makeAppNotRespondingLocked(app,
                    activity != null ? activity.shortComponentName : null,
                    annotation != null ? "ANR " + annotation : "ANR",
                    info.toString());

            //9、弹出ANR dialog，会调用handleShowAnrUi方法
            // Bring up the infamous App Not Responding dialog
            Message msg = Message.obtain();
            msg.what = ActivityManagerService.SHOW_NOT_RESPONDING_UI_MSG;
            msg.obj = new AppNotRespondingDialog.Data(app, activity, aboveSystem);

            mService.mUiHandler.sendMessage(msg);
        }
    }

主要流程如下：
1、写入event log
2、写入 main log
3、生成tracesFile
4、输出ANR logcat（控制台可以看到）
5、如果没有获取到tracesFile，会发一个SIGNAL_QUIT信号，这里看注释是会触发收集线程堆栈信息流程，写入traceFile
6、输出到drapbox
7、后台ANR，直接杀进程
8、错误报告
9、弹出ANR dialog，会调用 AppErrors#handleShowAnrUi方法。

ANR触发流程小结

ANR触发流程，可以比喻为埋炸弹和拆炸弹的过程，
以启动Service为例，Service的onCreate方法调用之前会使用Handler发送延时10s的消息，Service 的onCreate方法执行完，会把这个延时消息移除掉。
假如Service的onCreate方法耗时超过10s，延时消息就会被正常处理，也就是触发ANR，会收集cpu、堆栈等信息，弹ANR Dialog。

service、broadcast、provider 的ANR原理都是埋定时炸弹和拆炸弹原理，

但是input的超时检测机制稍微有点不同，需要等收到下一次input事件，才会去检测上一次input事件是否超时，input事件里埋的炸弹是普通炸弹，需要通过扫雷来排查

Anr 发生的场景

Anr 发生的原因

系统资源不足，其它进程或线程存在严重资源抢占，如 IO，Mem，CPU
线程间存在资源抢占，比如死锁等
主线程繁忙，用户输入得不到及时响应

传统监控(线下)

利用 Printer

public interface Printer {
    /**
     * Write a line of text to the output.  There is no need to terminate
     * the given string with a newline.
     */
    void println(String x);
}
```java
public static void loop() {
        for (;;) {
            //1、取消息
            Message msg = queue.next(); // might block
            ...
            //2、消息处理前回调
            if (logging != null) {
                logging.println(">>>>> Dispatching to " + msg.target + " " +
                        msg.callback + ": " + msg.what);
            }
            ...

            //3、消息开始处理
            msg.target.dispatchMessage(msg);// 分发处理消息
            ...

            //4、消息处理完回调
            if (logging != null) {
                logging.println("<<<<< Finished to " + msg.target + " " + msg.callback);
            }
       }
       ...
}

注释2和注释4的logging.println是谷歌提供给我们的一个接口，可以监听Handler处理消息耗时，我们只需要调用Looper.getMainLooper().setMessageLogging(printer)，即可从回调中拿到Handler处理一个消息的前后时间。需要注意的是，监听到发生卡顿之后，dispatchMessage 早已调用结束，已经出栈，此时再去获取主线程堆栈，堆栈中是不包含卡顿的代码的。

所以需要在后台开一个线程，定时获取主线程堆栈，将时间点作为key，堆栈信息作为value，保存到Map中，在发生卡顿的时候，取出卡顿时间段内的堆栈信息即可。

不过这种方案只适合线下使用，原因如下：

logging.println("<<<<< Finished to " + msg.target + " " + msg.callback);存在字符串拼接，频繁调用，会创建大量对象，造成内存抖动。
后台线程频繁获取主线程堆栈，对性能有一定影响，获取主线程堆栈，会暂停主线程的运行。

AMS字节码插桩监控(线上)

目前微信的Matrix 使用的卡顿监控方案就是字节码插桩

// 插桩前 
fun method(){
    run() 
} 
// 插桩后 
fun method(){
    input(1)
    run()
    output(1)
}

插桩需要注意的问题：

避免方法数暴增：在方法的入口和出口应该插入相同的函数，在编译时提前给代码中每个方法分配一个独立的 ID 作为参数。
过滤简单的函数：过滤一些类似直接 return、i++ 这样的简单函数，并且支持黑名单配置。对一些调用非常频繁的函数，需要添加到黑名单中来降低整个方案对性能的损耗。

微信Matrix做了大量优化，整体包体积增加1%-2%，帧率下降2帧以内，对性能影响整体可以接受，不过依然只会在灰度包使用。

Anr收集流程

发生 ANR 后，系统会采集许多进程数据，进行堆栈转储，以生成 ANR Trace文件。其中，第一个被采集的进程必定是发生 ANR 的进程。
系统会向这些应用进程发送 SIGQUIT 信号，这些应用进程收到信号后开始进行堆栈转储
应用进程 Dump 堆栈成功后通过 Socket 与系统进程通信写 Trace 文件
在 Trace 文件写入完成后，如果发生 ANR 的进程是前台进程则弹出 Dialog，否则则直接杀死进程

Signal Catcher 的 Dump 发生在应用进程，并且通过 Socket Writer 来写 Trace的。如果我们能够在这个 write 方法上进行 Hook，就可以获取到系统记录下来的 ANR Trace 内容。这个内容非常全面，包括了所有线程的各种状态、锁和堆栈信息（包括 native 堆栈），对于排查问题非常有帮助，特别是一些与 native 问题、死锁等有关的问题。

ANR 问题很多场景都是历史消息耗时较长并不断累加后导致的，但是在 ANR 发生时我们并不知道之前都调度了哪些消息，如果可以监控每次消息调度的耗时并记录，当发生 ANR 时，获取这些记录信息，并能计算出当前正在执行消息的耗时，是不是就可以清晰的知道 ANR 发生前主线程都发生了什么？按照这个思路，整理出如下示意图：

关键消息聚合：

Linux Nice值

Nice值是类UNIX操作系统中表示静态优先级的数值。每个进程都有自己的静态优先级，优先级高的进程得以优先运行。
Nice值的范围是-20~+19，拥有Nice值越大的进程的实际优先级越小（即Nice值为+19的进程优先级最小，为-20的进程优先级最大），默认的Nice值是0。由于Nice值是静态优先级，所以一经设定，就不会再被内核修改，直到被重新设定。Nice值只起干预CPU时间分配的作用，实际中的细节，由动态优先级决定。
“Nice值”这个名称来自英文单词nice，意思为友好。Nice值越高，这个进程越“友好”，就会让给其他进程越多的时间。

android ： ANR