Android ANR机制再梳理

1,533 阅读7分钟

从哪里开始阅读代码?

系统ANR流程的代码入口是 AMS#appNotResponding

ANR DUMP

在发生ANR后,系统会dump当前进程及关键进程的线程堆栈、状态

Skip ANR 场景

  1. During shutdown skipping ANR
  2. Skipping duplicate ANR
  3. Crashing app skipping ANR
  4. App already killed by AM skipping ANR
  5. Skipping died app ANR

仅dump ANR进程场景

如果dump时间超过1分钟,就只dump anr进程避免使系统更加繁忙

    /**
     * If the system is extremely slow somehow that the ANR has been pending too long for more than
     * this time, the information might be outdated. So we only the dump the unresponsive process
     * instead of including other processes to avoid making the system more busy.
     */
    private static final long EXPIRED_REPORT_TIME_MS = TimeUnit.MINUTES.toMillis(1);

如果是后台进程anr,也只dump anr进程

    /**
     * Unless configured otherwise, swallow ANRs in background processes & kill the process.
     * Non-private access is for tests only.
     */
    @VisibleForTesting
    boolean isSilentAnr() {
        return !getShowBackground() && !isInterestingForBackgroundTraces();
    }

ANR输出内容总结

触发ANR时系统会输出关键信息:(这个较耗时,可能会有10s)

  1. 将am_anr信息,输出到EventLog.(ANR开始起点看EventLog)
  2. 获取重要进程trace信息,保存到/data/anr/traces.txt;
  3. ANR reason以及CPU使用情况信息,输出到main log;
  4. 再将CPU使用情况和进程trace文件信息,再保存到/data/system/dropbox;

关键信息解读

ANR发生日志关键字

10-25 15:37:55.057 1000 1787 32274 I am_anr : [0,27455,com.unicom.woshipin,953712196,Input dispatching timed out (com.unicom.woshipin/com.unicom.wotvvertical.busiz.main.PortMainActivity, 49ec192 com.unicom.woshipin/com.unicom.wotvvertical.busiz.main.PortMainActivity (server) is not responding. Waited 8012ms for MotionEvent(action=DOWN))] 10-25 15:38:04.818 1000 1787 32274 E ActivityManager: ANR in com.unicom.woshipin (com.unicom.woshipin/com.unicom.wotvvertical.busiz.pugc.videopost.videoselect.VideoSelectActivity) 10-25 15:38:04.823 1000 1787 32274 D ActivityManager: Completed ANR of com.unicom.woshipin in 9834ms, latency 1ms

Trace

d1ad8b5cef6943158bdab668d76c39f3.png

  1. "SignalCatcher"线程:Android的Zygote在Fork进程的时候,都会在InitNonZygoteOrPostFork时调用StartSignalCatcher创建一个新的SignalCatcher线程,这个线程的作用就是用来捕获Linux信号。
  2. "Runnable"、"Waiting"、"state=xxx":表示当前线程的工作状态
  3. "utm=xxx"、"stm=xxx":表示该线程从创建到现在,被 CPU 调度的真实运行时长,不包括线程等待或者 Sleep 耗时,其中线程 CPU 耗时又可以进一步分为用户空间耗时(utm)和系统空间耗时(stm)。这里的单位是 jiffies,当 HZ=100 时,1utm 等于 10ms。
  4. "core=xxx”:最后执行这个线程但cpu核的序号
  5. "nice=xxx":线程优先级,数值越低优先级越高
  6. "schedstat=xxx":线程调度态,“schedstat=( 38065842 5059682 54 )”,分别表示线程 CPU 执行时长(单位 ns),等待时长,Switch 次数。

NativePollOnce场景

5da0bfb04a5aef3d5759b441aa67cef9.png 进入NativePollOnce场景的条件:

当MessageQueue没有需要立刻调度的消息时,会有条件的进入 wait(nativePollOnce) 状态,等到超时或者新消息加入时会唤醒(nativeWake)该线程并执行。

如果ANR日志里主线程是这样一个状态,有可能两个原因:

  1. 该ANR是CPU抢占或内存紧张等其他因素引起
  2. 这份ANR日志抓取的时候,主线程已经恢复正常

ANR info

59e7ea335c6b76ea5a80320479a917f1.png

  1. "Reason":ANR类型:如Input、Receiver、Service等等
  2. "Load: 0.44 / 0.54 / 0.8":系统负载,分别代表 ANR 发生前 1 分钟,前 5 分钟,前 15 分钟各个时间段系统 CPU 负载,具体数值代表单位时间等待系统调度的任务数(可以理解为线程)
  3. "CPU usage from XXX to XXX ago":ANR发生前的进程CPU占用
  4. "CPU usage from XXX to XXX later":ANR发生后的进程CPU占用
  5. CPU占用 由 user/kernel/iowait几部分组成

关键进程:

  1. kswapd:  是 linux 中用于页面回收的内核线程,主要用来维护可用内存与文件缓存的平衡,以追求性能最大化,当该线程 CPU 占用过高,说明系统可用内存紧张,或者内存碎片化严重,需要进行 file cache 回写或者内存交换(交换到磁盘),线程 CPU 过高则系统整体性能将会明显下降,进而影响所有应用调度。
  2. mmcqd: 内核线程,主要作用是把上层的 IO 请求进行统一管理和转发到 Driver 层,当该线程 CPU 占用过高,说明系统存在大量文件读写,当然如果内存紧张也会触发文件回写和内存交换到磁盘,所以 kswapdmmcqd 经常是同步出现的
  3. kworker: 表示执行”work”(处理系统调用)的Linux内核进程。你可以在你的进程列表中有几个:kworker/0:1是你的第一个CPU核心,kworker/1:1是第二个等。

Logcat日志

Android 系统在一些频繁调用的接口中,分别在方法前后利用 checktime、checkSLow 检测,以判断本次函数执行耗时是否超过设定阈值。当实际耗时超过设置阈值,则会给出“Slow XXX”提示,表示系统进程调度受到了影响。

一般来说这些系统进程优先级比较高,如果系统进程调度都受到了影响,那么则反映了这段时间内系统性能很有可能发生了问题。

"Slow operation"

5cfbf856a3a1a427b08e5b1363de5329.png 系统Slow operation计算思路:

    private void checkSlow(long startTime, String where) {
        long now = SystemClock.uptimeMillis();
        if ((now - startTime) > 50) {
            // If we are taking more than 50ms, log about it.
            Slog.w(TAG, "Slow operation: " + (now - startTime) + "ms so far, now at " + where);
        }
    }

ANR监测原理

系统在通过 Binder 通信向应用进程发送上述组件消息或 Input 事件时,在 AMS 或 Input 服务端同时设置一个异步超时监控。

ANR分类

794ce04c0917168d23aa5e6b710401e8.png

本文简单分析整理其中的几个场景:

InputDispatcher Timeout原理

应用接收到输入事件时,会向系统注册该输入事件对应的输入通道、输入窗口等信息,InputDispatcher会监听这些注册的输入事件。

当输入事件执行完成了,就会向系统发送处理完成finish消息。

如果在5s内,没有收到输入事件结束finish消息,则InputDispatcher发送ANR通知。

input event1:通过InputChannel分发mPendingEvnet到APP层后, mPendingEvnet保存到waitQueue;

input event2:在下一个input event到来时,waitQueue不为空,导致checkWindowReadyForMoreInputLocked返回值不为空,触发handleTargetsNotReadyLocked,然后将当前时间+5s作为mInputTargetWaitTimeoutTime,并设置mInputTargetWaitTimeoutTime为mLooper下一次唤醒的时间

BroadcastReceiver Timout原理

静态注册的广播和有序广播会ANR,动态注册的非有序广播并不会ANR

  1. 无序广播将一次性发送给所有接收者,并且不会设置超时监控。
final void processNextBroadcastLocked(boolean fromMsg, boolean skipOomAdj, boolean parallelOnce) {
  ...
    // First, deliver any non-serialized broadcasts right away.
    while (mParallelBroadcasts.size() > 0) {
        r = mParallelBroadcasts.remove(0);
        r.dispatchTime = SystemClock.uptimeMillis();
        r.dispatchClockTime = System.currentTimeMillis();
        ...
        final int N = r.receivers.size();
        if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG_BROADCAST, "Processing parallel broadcast ["
                + mQueueName + "] " + r);
        for (int i=0; i<N; i++) {
            Object target = r.receivers.get(i);
            if (DEBUG_BROADCAST)  Slog.v(TAG_BROADCAST,
                    "Delivering non-ordered on [" + mQueueName + "] to registered "
                    + target + ": " + r);
            deliverToRegisteredReceiverLocked(r, (BroadcastFilter)target, false, i);
        }
        addBroadcastToHistoryLocked(r);
        if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG_BROADCAST, "Done with parallel broadcast ["
                + mQueueName + "] " + r);
    }
  ...
}
  1. 有序广播
    final void processNextBroadcastLocked(boolean fromMsg, boolean skipOomAdj, boolean parallelOnce) {
      ...
        // Get the next receiver...
        int recIdx = r.nextReceiver++;
          
        // Keep track of when this receiver started, and make sure there
        // is a timeout message pending to kill it if need be.
        r.receiverTime = SystemClock.uptimeMillis();
        if (recIdx == 0) {
            r.dispatchTime = r.receiverTime;
            r.dispatchClockTime = System.currentTimeMillis();
            ...
        }
        if (! mPendingBroadcastTimeoutMessage) {
            long timeoutTime = r.receiverTime + mConstants.TIMEOUT;
            if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST,
                    "Submitting BROADCAST_TIMEOUT_MSG ["
                    + mQueueName + "] for " + r + " at " + timeoutTime);
            // 设置超时监控
            setBroadcastTimeoutLocked(timeoutTime);
        }
    
        final BroadcastOptions brOptions = r.options;
        final Object nextReceiver = r.receivers.get(recIdx);
    
        if (nextReceiver instanceof BroadcastFilter) {
            // Simple case: this is a registered receiver who gets
            // a direct call.
            BroadcastFilter filter = (BroadcastFilter)nextReceiver;
            if (DEBUG_BROADCAST)  Slog.v(TAG_BROADCAST,
                    "Delivering ordered ["
                    + mQueueName + "] to registered "
                    + filter + ": " + r);
            // 发送给接收者
            deliverToRegisteredReceiverLocked(r, filter, r.ordered, recIdx);
            if (r.receiver == null || !r.ordered) {
                // The receiver has already finished, so schedule to
                // process the next one.
                if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "Quick finishing ["
                        + mQueueName + "]: ordered="
                        + r.ordered + " receiver=" + r.receiver);
                r.state = BroadcastRecord.IDLE;
                scheduleBroadcastsLocked();
                
     ...
    }
    

有序广播执行流程

processNextBroadcastLocked

-> deliverToRegisteredReceiverLocked

-> ActivityThread#scheduleReceiver向ActivityThread主线程消息队列插入一条H.RECEIVER消息

-> H.RECEIVER消息执行handleReceiver

-> receiver.onReceive(context.getReceiverRestrictedContext(), data.intent);

-> handleReceiver执行完毕最后会调用data.finish()方法通知AMS

-> AMS#finishReceiver方法中调用processNextBroadcastLocked执行下一条broadcast

-> 在processNextBroadcastLocked分发无序广播后会调用cancelBroadcastTimeoutLocked取消上一个超时监控

广播发送时,会判断该进程是否存在,不存在则创建,创建进程的耗时也算在超时时间里

ContentProvider Timout原理

ContentProvider的超时是在provider进程首次启动的时候才会检测 当provider进程已启动的场景,再次请求provider并不会触发provider超时。

ContentProvider的超时监听是发生在 AMS#attachApplicationLocked过程中

  private boolean attachApplicationLocked(@NonNull IApplicationThread thread,
          int pid, int callingUid, long startSeq) {
  ...
          if (providers != null && checkAppInLaunchingProvidersLocked(app)) {
          Message msg = mHandler.obtainMessage(CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG);
          msg.obj = app;
          mHandler.sendMessageDelayed(msg,
                  ContentResolver.CONTENT_PROVIDER_PUBLISH_TIMEOUT_MILLIS);

      }

      checkTime(startTime, "attachApplicationLocked: before bindApplication");
  ...
  }

监听消息的移除发生在AMS#publishContentProviders过程,即在完成provider的安装工作后移除TIMEOUT消息

    public final void publishContentProviders(IApplicationThread caller,
            List<ContentProviderHolder> providers) {
        if (providers == null) {
            return;
        }

        enforceNotIsolatedCaller("publishContentProviders");
        synchronized (this) {
            final ProcessRecord r = getRecordForAppLocked(caller);
            if (DEBUG_MU) Slog.v(TAG_MU, "ProcessRecord uid = " + r.uid);
            if (r == null) {
                throw new SecurityException(
                        "Unable to find app for caller " + caller
                      + " (pid=" + Binder.getCallingPid()
                      + ") when publishing content providers");
            }

            final long origId = Binder.clearCallingIdentity();

            final int N = providers.size();
            for (int i = 0; i < N; i++) {
                ContentProviderHolder src = providers.get(i);
                if (src == null || src.info == null || src.provider == null) {
                    continue;
                }
                ContentProviderRecord dst = r.pubProviders.get(src.info.name);
                if (DEBUG_MU) Slog.v(TAG_MU, "ContentProviderRecord uid = " + dst.uid);
                if (dst != null) {
                    ComponentName comp = new ComponentName(dst.info.packageName, dst.info.name);
                    mProviderMap.putProviderByClass(comp, dst);
                    String names[] = dst.info.authority.split(";");
                    for (int j = 0; j < names.length; j++) {
                        mProviderMap.putProviderByName(names[j], dst);
                    }

                    int launchingCount = mLaunchingProviders.size();
                    int j;
                    boolean wasInLaunchingProviders = false;
                    for (j = 0; j < launchingCount; j++) {
                        if (mLaunchingProviders.get(j) == dst) {
                            mLaunchingProviders.remove(j);
                            wasInLaunchingProviders = true;
                            j--;
                            launchingCount--;
                        }
                    }
                    if (wasInLaunchingProviders) {
                        mHandler.removeMessages(CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG, r);
                    }
                    ....
                }
            }

            Binder.restoreCallingIdentity(origId);
        }
    }

ContentProvider触发超时不会产生ANR,而是直接移除进程

com/android/server/am/ActivityManagerService.java
...
case CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG: {
    ProcessRecord app = (ProcessRecord)msg.obj;
    synchronized (ActivityManagerService.this) {
        processContentProviderPublishTimedOutLocked(app);
    }
                
private final void processContentProviderPublishTimedOutLocked(ProcessRecord app) {
  cleanupAppInLaunchingProvidersLocked(app, true);
  mProcessList.removeProcessLocked(app, false, true,
          ApplicationExitInfo.REASON_INITIALIZATION_FAILURE,
          ApplicationExitInfo.SUBREASON_UNKNOWN,
          "timeout publishing content providers");
}

Service Timeout原理

  1. Service的以下回调都会触发ANR:onCreate(), onStartCommand(), onStart(), onBind(), onRebind(), onTaskRemoved(), onUnbind(), onDestroy().
  2. 前台Service超时时间为20s,后台Service超时时间为200s
  3. 如何区分前台、后台执行————当前APP处于用户态,此时执行的Service则为前台执行。
  4. 用户态:进程有前台activity、有前台广播在执行、有foreground service执行

ANR机制的局限性

通过前面我们理解各类型ANR的原理,我们可以发现系统是通过发送超时消息启动异步监测。

在等待执行超时消息时:

  1. 如果是进程启动场景,有大量业务和库初始化,在插入超时消息前已经有很多条消息调度
  2. 有些场景是少量消息某一条或多条消息耗时
  3. 有些场景是其他线程或系统负载过高,导致整个系统的卡顿

在以上这些场景,如果没有来得及通知AMS移除超时消息,便会触发ANR机制,

然后dump ANR进程某个消息执行过程的堆栈。

所以发生ANR问题时,很多时候Trace堆栈都不是RootCause

ANR info中提示但某个组件导致的ANR也有很大可能不是组件本身的问题。


参考文章:

今日头条 ANR 优化实践系列 - 设计原理及影响因素

ANR视角InputDispatcher