从哪里开始阅读代码?
系统ANR流程的代码入口是 AMS#appNotResponding
ANR DUMP
在发生ANR后,系统会dump当前进程及关键进程的线程堆栈、状态
Skip ANR 场景
- During shutdown skipping ANR
- Skipping duplicate ANR
- Crashing app skipping ANR
- App already killed by AM skipping ANR
- Skipping died app ANR
仅dump ANR进程场景
如果dump时间超过1分钟,就只dump anr进程避免使系统更加繁忙
/**
* If the system is extremely slow somehow that the ANR has been pending too long for more than
* this time, the information might be outdated. So we only the dump the unresponsive process
* instead of including other processes to avoid making the system more busy.
*/
private static final long EXPIRED_REPORT_TIME_MS = TimeUnit.MINUTES.toMillis(1);
如果是后台进程anr,也只dump anr进程
/**
* Unless configured otherwise, swallow ANRs in background processes & kill the process.
* Non-private access is for tests only.
*/
@VisibleForTesting
boolean isSilentAnr() {
return !getShowBackground() && !isInterestingForBackgroundTraces();
}
ANR输出内容总结
触发ANR时系统会输出关键信息:(这个较耗时,可能会有10s)
- 将am_anr信息,输出到EventLog.(ANR开始起点看EventLog)
- 获取重要进程trace信息,保存到/data/anr/traces.txt;
- ANR reason以及CPU使用情况信息,输出到main log;
- 再将CPU使用情况和进程trace文件信息,再保存到/data/system/dropbox;
关键信息解读
ANR发生日志关键字
10-25 15:37:55.057 1000 1787 32274 I am_anr : [0,27455,com.unicom.woshipin,953712196,Input dispatching timed out (com.unicom.woshipin/com.unicom.wotvvertical.busiz.main.PortMainActivity, 49ec192 com.unicom.woshipin/com.unicom.wotvvertical.busiz.main.PortMainActivity (server) is not responding. Waited 8012ms for MotionEvent(action=DOWN))] 10-25 15:38:04.818 1000 1787 32274 E ActivityManager: ANR in com.unicom.woshipin (com.unicom.woshipin/com.unicom.wotvvertical.busiz.pugc.videopost.videoselect.VideoSelectActivity) 10-25 15:38:04.823 1000 1787 32274 D ActivityManager: Completed ANR of com.unicom.woshipin in 9834ms, latency 1ms
Trace
- "SignalCatcher"线程:Android的Zygote在Fork进程的时候,都会在InitNonZygoteOrPostFork时调用StartSignalCatcher创建一个新的SignalCatcher线程,这个线程的作用就是用来捕获Linux信号。
- "Runnable"、"Waiting"、"state=xxx":表示当前线程的工作状态
- "utm=xxx"、"stm=xxx":表示该线程从创建到现在,被 CPU 调度的真实运行时长,不包括线程等待或者 Sleep 耗时,其中线程 CPU 耗时又可以进一步分为用户空间耗时(utm)和系统空间耗时(stm)。这里的单位是 jiffies,当 HZ=100 时,1utm 等于 10ms。
- "core=xxx”:最后执行这个线程但cpu核的序号
- "nice=xxx":线程优先级,数值越低优先级越高
- "schedstat=xxx":线程调度态,“schedstat=( 38065842 5059682 54 )”,分别表示线程 CPU 执行时长(单位 ns),等待时长,Switch 次数。
NativePollOnce场景
进入NativePollOnce场景的条件:
当MessageQueue没有需要立刻调度的消息时,会有条件的进入 wait(nativePollOnce) 状态,等到超时或者新消息加入时会唤醒(nativeWake)该线程并执行。
如果ANR日志里主线程是这样一个状态,有可能两个原因:
- 该ANR是CPU抢占或内存紧张等其他因素引起
- 这份ANR日志抓取的时候,主线程已经恢复正常
ANR info
- "Reason":ANR类型:如Input、Receiver、Service等等
- "Load: 0.44 / 0.54 / 0.8":系统负载,分别代表 ANR 发生前 1 分钟,前 5 分钟,前 15 分钟各个时间段系统 CPU 负载,具体数值代表单位时间等待系统调度的任务数(可以理解为线程)
- "CPU usage from XXX to XXX ago":ANR发生前的进程CPU占用
- "CPU usage from XXX to XXX later":ANR发生后的进程CPU占用
- CPU占用 由 user/kernel/iowait几部分组成
关键进程:
- kswapd: 是 linux 中用于页面回收的内核线程,主要用来维护可用内存与文件缓存的平衡,以追求性能最大化,当该线程 CPU 占用过高,说明
系统可用内存紧张
,或者内存碎片化严重,需要进行 file cache 回写或者内存交换(交换到磁盘),线程 CPU 过高则系统整体性能将会明显下降,进而影响所有应用调度。 - mmcqd: 内核线程,主要作用是把上层的 IO 请求进行统一管理和转发到 Driver 层,当该线程 CPU 占用过高,说明系统存在大量文件读写,当然如果内存紧张也会触发文件回写和内存交换到磁盘,所以
kswapd
和mmcqd
经常是同步出现的 - kworker: 表示执行”work”(处理系统调用)的Linux内核进程。你可以在你的进程列表中有几个:
kworker/0:1
是你的第一个CPU核心,kworker/1:1
是第二个等。
Logcat日志
Android 系统在一些频繁调用的接口中,分别在方法前后利用 checktime、checkSLow 检测,以判断本次函数执行耗时是否超过设定阈值。当实际耗时超过设置阈值,则会给出“Slow XXX”提示,表示系统进程调度受到了影响。
一般来说这些系统进程优先级比较高,如果系统进程调度都受到了影响,那么则反映了这段时间内系统性能很有可能发生了问题。
"Slow operation"
系统Slow operation计算思路:
private void checkSlow(long startTime, String where) {
long now = SystemClock.uptimeMillis();
if ((now - startTime) > 50) {
// If we are taking more than 50ms, log about it.
Slog.w(TAG, "Slow operation: " + (now - startTime) + "ms so far, now at " + where);
}
}
ANR监测原理
系统在通过 Binder 通信向应用进程发送上述组件消息或 Input 事件时,在 AMS 或 Input 服务端同时设置一个异步超时监控。
ANR分类
本文简单分析整理其中的几个场景:
InputDispatcher Timeout原理
应用接收到输入事件时,会向系统注册该输入事件对应的输入通道、输入窗口等信息,InputDispatcher会监听这些注册的输入事件。
当输入事件执行完成了,就会向系统发送处理完成finish消息。
如果在5s内,没有收到输入事件结束finish消息,则InputDispatcher发送ANR通知。
input event1:通过InputChannel分发mPendingEvnet到APP层后, mPendingEvnet保存到waitQueue;
input event2:在下一个input event到来时,waitQueue不为空,导致checkWindowReadyForMoreInputLocked返回值不为空,触发handleTargetsNotReadyLocked,然后将当前时间+5s作为mInputTargetWaitTimeoutTime,并设置mInputTargetWaitTimeoutTime为mLooper下一次唤醒的时间
BroadcastReceiver Timout原理
静态注册的广播和有序广播会ANR,动态注册的非有序广播并不会ANR
- 无序广播将一次性发送给所有接收者,并且不会设置超时监控。
final void processNextBroadcastLocked(boolean fromMsg, boolean skipOomAdj, boolean parallelOnce) {
...
// First, deliver any non-serialized broadcasts right away.
while (mParallelBroadcasts.size() > 0) {
r = mParallelBroadcasts.remove(0);
r.dispatchTime = SystemClock.uptimeMillis();
r.dispatchClockTime = System.currentTimeMillis();
...
final int N = r.receivers.size();
if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG_BROADCAST, "Processing parallel broadcast ["
+ mQueueName + "] " + r);
for (int i=0; i<N; i++) {
Object target = r.receivers.get(i);
if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST,
"Delivering non-ordered on [" + mQueueName + "] to registered "
+ target + ": " + r);
deliverToRegisteredReceiverLocked(r, (BroadcastFilter)target, false, i);
}
addBroadcastToHistoryLocked(r);
if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG_BROADCAST, "Done with parallel broadcast ["
+ mQueueName + "] " + r);
}
...
}
- 有序广播
final void processNextBroadcastLocked(boolean fromMsg, boolean skipOomAdj, boolean parallelOnce) { ... // Get the next receiver... int recIdx = r.nextReceiver++; // Keep track of when this receiver started, and make sure there // is a timeout message pending to kill it if need be. r.receiverTime = SystemClock.uptimeMillis(); if (recIdx == 0) { r.dispatchTime = r.receiverTime; r.dispatchClockTime = System.currentTimeMillis(); ... } if (! mPendingBroadcastTimeoutMessage) { long timeoutTime = r.receiverTime + mConstants.TIMEOUT; if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "Submitting BROADCAST_TIMEOUT_MSG [" + mQueueName + "] for " + r + " at " + timeoutTime); // 设置超时监控 setBroadcastTimeoutLocked(timeoutTime); } final BroadcastOptions brOptions = r.options; final Object nextReceiver = r.receivers.get(recIdx); if (nextReceiver instanceof BroadcastFilter) { // Simple case: this is a registered receiver who gets // a direct call. BroadcastFilter filter = (BroadcastFilter)nextReceiver; if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "Delivering ordered [" + mQueueName + "] to registered " + filter + ": " + r); // 发送给接收者 deliverToRegisteredReceiverLocked(r, filter, r.ordered, recIdx); if (r.receiver == null || !r.ordered) { // The receiver has already finished, so schedule to // process the next one. if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "Quick finishing [" + mQueueName + "]: ordered=" + r.ordered + " receiver=" + r.receiver); r.state = BroadcastRecord.IDLE; scheduleBroadcastsLocked(); ... }
有序广播执行流程
processNextBroadcastLocked
-> deliverToRegisteredReceiverLocked
-> ActivityThread#scheduleReceiver向ActivityThread主线程消息队列插入一条H.RECEIVER消息
-> H.RECEIVER消息执行handleReceiver
-> receiver.onReceive(context.getReceiverRestrictedContext(), data.intent);
-> handleReceiver执行完毕最后会调用data.finish()方法通知AMS
-> AMS#finishReceiver方法中调用processNextBroadcastLocked执行下一条broadcast
-> 在processNextBroadcastLocked分发无序广播后会调用cancelBroadcastTimeoutLocked取消上一个超时监控
广播发送时,会判断该进程是否存在,不存在则创建,创建进程的耗时也算在超时时间里
ContentProvider Timout原理
ContentProvider的超时是在provider进程首次启动的时候才会检测 当provider进程已启动的场景,再次请求provider并不会触发provider超时。
ContentProvider的超时监听是发生在 AMS#attachApplicationLocked过程中
private boolean attachApplicationLocked(@NonNull IApplicationThread thread,
int pid, int callingUid, long startSeq) {
...
if (providers != null && checkAppInLaunchingProvidersLocked(app)) {
Message msg = mHandler.obtainMessage(CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG);
msg.obj = app;
mHandler.sendMessageDelayed(msg,
ContentResolver.CONTENT_PROVIDER_PUBLISH_TIMEOUT_MILLIS);
}
checkTime(startTime, "attachApplicationLocked: before bindApplication");
...
}
监听消息的移除发生在AMS#publishContentProviders过程,即在完成provider的安装工作后移除TIMEOUT消息
public final void publishContentProviders(IApplicationThread caller,
List<ContentProviderHolder> providers) {
if (providers == null) {
return;
}
enforceNotIsolatedCaller("publishContentProviders");
synchronized (this) {
final ProcessRecord r = getRecordForAppLocked(caller);
if (DEBUG_MU) Slog.v(TAG_MU, "ProcessRecord uid = " + r.uid);
if (r == null) {
throw new SecurityException(
"Unable to find app for caller " + caller
+ " (pid=" + Binder.getCallingPid()
+ ") when publishing content providers");
}
final long origId = Binder.clearCallingIdentity();
final int N = providers.size();
for (int i = 0; i < N; i++) {
ContentProviderHolder src = providers.get(i);
if (src == null || src.info == null || src.provider == null) {
continue;
}
ContentProviderRecord dst = r.pubProviders.get(src.info.name);
if (DEBUG_MU) Slog.v(TAG_MU, "ContentProviderRecord uid = " + dst.uid);
if (dst != null) {
ComponentName comp = new ComponentName(dst.info.packageName, dst.info.name);
mProviderMap.putProviderByClass(comp, dst);
String names[] = dst.info.authority.split(";");
for (int j = 0; j < names.length; j++) {
mProviderMap.putProviderByName(names[j], dst);
}
int launchingCount = mLaunchingProviders.size();
int j;
boolean wasInLaunchingProviders = false;
for (j = 0; j < launchingCount; j++) {
if (mLaunchingProviders.get(j) == dst) {
mLaunchingProviders.remove(j);
wasInLaunchingProviders = true;
j--;
launchingCount--;
}
}
if (wasInLaunchingProviders) {
mHandler.removeMessages(CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG, r);
}
....
}
}
Binder.restoreCallingIdentity(origId);
}
}
ContentProvider触发超时不会产生ANR,而是直接移除进程
com/android/server/am/ActivityManagerService.java
...
case CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG: {
ProcessRecord app = (ProcessRecord)msg.obj;
synchronized (ActivityManagerService.this) {
processContentProviderPublishTimedOutLocked(app);
}
private final void processContentProviderPublishTimedOutLocked(ProcessRecord app) {
cleanupAppInLaunchingProvidersLocked(app, true);
mProcessList.removeProcessLocked(app, false, true,
ApplicationExitInfo.REASON_INITIALIZATION_FAILURE,
ApplicationExitInfo.SUBREASON_UNKNOWN,
"timeout publishing content providers");
}
Service Timeout原理
- Service的以下回调都会触发ANR:onCreate(), onStartCommand(), onStart(), onBind(), onRebind(), onTaskRemoved(), onUnbind(), onDestroy().
- 前台Service超时时间为20s,后台Service超时时间为200s
- 如何区分前台、后台执行————当前APP处于用户态,此时执行的Service则为前台执行。
- 用户态:进程有前台activity、有前台广播在执行、有foreground service执行
ANR机制的局限性
通过前面我们理解各类型ANR的原理,我们可以发现系统是通过发送超时消息启动异步监测。
在等待执行超时消息时:
- 如果是进程启动场景,有大量业务和库初始化,在插入超时消息前已经有很多条消息调度
- 有些场景是少量消息某一条或多条消息耗时
- 有些场景是其他线程或系统负载过高,导致整个系统的卡顿
在以上这些场景,如果没有来得及通知AMS移除超时消息,便会触发ANR机制,
然后dump ANR进程某个消息执行过程的堆栈。
所以发生ANR问题时,很多时候Trace堆栈都不是RootCause
ANR info中提示但某个组件导致的ANR也有很大可能不是组件本身的问题。
参考文章: