前言
近期在 Android 14 的项目测试上发生批量死机,OOM、NE等诸多问题,问题的根本原因来源于一同事使用了 Java14 引入的新特性 record 关键字,在 Android 14 ART虚拟机开始将其作为基本类型,可见 android-14.0.0_r1 分支上的更新。
具体见:
android-review.googlesource.com/c/platform/…
问题
pid: 1842, tid: 1866, name: HeapTaskDaemon >>> system_server <<<
uid: 1000
tagged_addr_ctrl: 0000000000000001 (PR_TAGGED_ADDR_ENABLE)
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0000001411d50c74
x0 0000000000000001 x1 000000000d9a9db0 x2 00000073765fe430 x3 0000000000000000
x4 0000000000000000 x5 0000000000000000 x6 0000000000000000 x7 b40000739d242c00
x8 00000073765ff000 x9 0000007393c03000 x10 0040000000000000 x11 1042104444444222
x12 0040000000000000 x13 000000007fffffff x14 0000000000000000 x15 0000000572cd3e90
x16 000000739c41aae8 x17 000000742c3492e0 x18 0000007308cc0000 x19 0000001411d50c60
x20 0000001411d50c74 x21 00000073765fe430 x22 b40000736d452000 x23 b40000739d2299d0
x24 00000073765ff000 x25 000000739392c228 x26 0000000000000000 x27 b40000739d23e000
x28 000000739c622000 x29 00000073765fe330
lr 000000739bf3c550 sp 00000073765fe310 pc 000000739bf3c580 pst 0000000080001000
20 total frames
backtrace:
#00 pc 0000000000331580 /apex/com.android.art/lib64/libart.so (void art::mirror::ClassLoader::VisitReferences<true, (art::VerifyObjectFlags)0, (art::ReadBarrierOption)1, art::gc::collector::ConcurrentCopying::ComputeLiveBytesAndMarkRefFieldsVisitor<true> >(art::ObjPtr<art::mirror::Class>, art::gc::collector::ConcurrentCopying::ComputeLiveBytesAndMarkRefFieldsVisitor<true> const&)+112) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#01 pc 00000000003143a8 /apex/com.android.art/lib64/libart.so (void art::mirror::Object::VisitReferences<true, (art::VerifyObjectFlags)0, (art::ReadBarrierOption)1, art::gc::collector::ConcurrentCopying::ComputeLiveBytesAndMarkRefFieldsVisitor<true>, art::gc::collector::ConcurrentCopying::ComputeLiveBytesAndMarkRefFieldsVisitor<true> >(art::gc::collector::ConcurrentCopying::ComputeLiveBytesAndMarkRefFieldsVisitor<true> const&, art::gc::collector::ConcurrentCopying::ComputeLiveBytesAndMarkRefFieldsVisitor<true> const&)+888) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#02 pc 0000000000313f5c /apex/com.android.art/lib64/libart.so (art::gc::collector::ConcurrentCopying::AddLiveBytesAndScanRef(art::mirror::Object*)+220) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#03 pc 0000000000314a50 /apex/com.android.art/lib64/libart.so (art::gc::collector::ConcurrentCopying::ProcessMarkStackForMarkingAndComputeLiveBytes()+912) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#04 pc 000000000030d974 /apex/com.android.art/lib64/libart.so (art::gc::collector::ConcurrentCopying::MarkingPhase()+2388) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#05 pc 000000000030c574 /apex/com.android.art/lib64/libart.so (art::gc::collector::ConcurrentCopying::RunPhases()+228) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#06 pc 0000000000334720 /apex/com.android.art/lib64/libart.so (art::gc::collector::GarbageCollector::Run(art::gc::GcCause, bool)+304) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#07 pc 000000000036fe30 /apex/com.android.art/lib64/libart.so (art::gc::Heap::CollectGarbageInternal(art::gc::collector::GcType, art::gc::GcCause, bool, unsigned int)+2528) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#08 pc 0000000000380e10 /apex/com.android.art/lib64/libart.so (art::gc::Heap::ConcurrentGC(art::Thread*, art::gc::GcCause, bool, unsigned int)+160) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#09 pc 0000000000387f68 /apex/com.android.art/lib64/libart.so (art::gc::Heap::ConcurrentGCTask::Run(art::Thread*)+72) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#10 pc 00000000003c3420 /apex/com.android.art/lib64/libart.so (art::gc::TaskProcessor::RunAllTasks(art::Thread*)+64) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#11 pc 0000000000010e54 /system/framework/arm64/boot-core-libart.oat (art_jni_trampoline+116) (BuildId: 13f7228dccd5041162c83893b5e33fc985974887)
#12 pc 00000000000479f8 /system/framework/arm64/boot-core-libart.oat (java.lang.Daemons$HeapTaskDaemon.runInternal+200) (BuildId: 13f7228dccd5041162c83893b5e33fc985974887)
#13 pc 000000000001f04c /system/framework/arm64/boot-core-libart.oat (java.lang.Daemons$Daemon.run+172) (BuildId: 13f7228dccd5041162c83893b5e33fc985974887)
#14 pc 0000000000163e08 /system/framework/arm64/boot.oat (java.lang.Thread.run+72) (BuildId: 345b0d7377ea4aa1745b6c1f507455fc89836962)
#15 pc 00000000002109a4 /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+612) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#16 pc 0000000000253b0c /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+172) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#17 pc 00000000006a80c8 /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1416) (BuildId: 2b54ae607a1150ec97780dec66cb5869)
#18 pc 0000000000101d5c /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+204) (BuildId: 84a42637b3a421b801818f5793418fca)
#19 pc 0000000000095bc0 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 84a42637b3a421b801818f5793418fca)
Core 分析
(gdb) bt
#0 std::__1::__atomic_base<int, false>::load (this=0x1411d50c74, __m=std::__1::memory_order_relaxed)
#1 art::ReaderWriterMutex::SharedLock (this=0x1411d50c60, self=0xb40000736d452000)
#2 art::ReaderMutexLock::ReaderMutexLock (self=0xb40000736d452000, mu=..., this=<optimized out>)
#3 art::ClassTable::VisitRoots<art::gc::collector::ConcurrentCopying::ComputeLiveBytesAndMarkRefFieldsVisitor<true> > (this=0x1411d50c60, visitor=...)
#4 art::mirror::ClassLoader::VisitReferences<true, (art::VerifyObjectFlags)0, (art::ReadBarrierOption)1, art::gc::collector::ConcurrentCopying::ComputeLiveBytesAndMarkRefFieldsVisitor<true> > (
this=<optimized out>, klass=..., visitor=...)
#5 0x000000739bf1f3ac in art::mirror::Object::VisitReferences<true, (art::VerifyObjectFlags)0, (art::ReadBarrierOption)1, art::gc::collector::ConcurrentCopying::ComputeLiveBytesAndMarkRefFieldsVisitor<true>, art::gc::collector::ConcurrentCopying::ComputeLiveBytesAndMarkRefFieldsVisitor<true> > (this=0x11d50c10, visitor=..., ref_visitor=...)
#6 0x000000739bf1ef60 in art::gc::collector::ConcurrentCopying::AddLiveBytesAndScanRef (this=0xb40000739d23e000, ref=0x11d50c10)
#7 0x000000739bf1fa54 in art::gc::collector::ConcurrentCopying::ProcessMarkStackForMarkingAndComputeLiveBytes (this=0xb40000739d23e000)
#8 0x000000739bf18978 in art::gc::collector::ConcurrentCopying::MarkingPhase (this=<optimized out>)
#9 0x000000739bf17578 in art::gc::collector::ConcurrentCopying::RunPhases (this=0xb40000739d23e000)
#10 0x000000739bf3f724 in art::gc::collector::GarbageCollector::Run (this=0xb40000739d23e000, gc_cause=art::gc::kGcCauseBackground, clear_soft_references=<optimized out>)
从 GDB 上解析 #0 PC 附近代码查看其错误的直接原因,0x739bf3c574 ~ 0x739bf3c580 的行为实际是从对象 art::ReaderWriterMutex 里取 state_ 的值发生 SEGV 错误。显然地址 0x1411d50c74 不在任一有效的内存上段。
(gdb) x /20i $pc-0x20
0x739bf3c560: cbz w8, 0x739bf3c570
0x739bf3c564: mrs x8, tpidr_el0
0x739bf3c568: ldr x22, [x8, #56]
0x739bf3c56c: b 0x739bf3c574
0x739bf3c570: mov x22, xzr
0x739bf3c574: add x20, x19, #0x14
0x739bf3c578: nop
0x739bf3c57c: nop
=> 0x739bf3c580: ldr w2, [x20]
0x739bf3c584: tbnz w2, #31, 0x739bf3c5a0
(gdb) ptype /o 'art::ReaderWriterMutex'
/* offset | size */ type = class art::ReaderWriterMutex : public art::BaseMutex {
private:
/* 20 | 4 */ art::AtomicInteger state_;
/* 24 | 4 */ class art::Atomic<int> [with T = int] : public std::__1::atomic<T> {
/* total size (bytes): 4 */
} exclusive_owner_;
/* 28 | 4 */ art::AtomicInteger num_contenders_;
/* total size (bytes): 32 */
}
跟踪代码线索可知该锁来自 art::ClassTable (0x1411d50c60) 对象,我们看到 GDB#4 从 ClassLoader 上进行遍历,我们都知道对象 ClassLoader 的数据结构里含有 classTable 的 Native 指针。
core-parser> class java.lang.ClassLoader -f
[0x6fa79238]
public abstract class java.lang.ClassLoader extends java.lang.Object {
// Object instance fields:
[0x20] private transient long classTable
[0x18] private transient long allocator
[0x10] public final java.util.Map proxyCache
[0x0c] private final java.lang.ClassLoader parent
[0x08] private final java.util.HashMap packages
// extends java.lang.Object
[0x04] private transient int shadow$_monitor_
[0x00] private transient java.lang.Class shadow$_klass_
}
到此为止,我们可以判断 Java 对象 ClassLoader 的内存存在问题。
虚拟机内存分析
首先我们先要找到该 ClassLoader 的内存地址,可以看到 GDB#5 处。(此处为 android-14.0.0_r1)
此处得知,ClassLoader 由 Object 直接强转获得,因此 ClassLoader 对象地址为 0x11d50c10,该类是同事引入的,保护隐私文章内混淆处理了,不影响分析。
core-parser> p 0x11d50c10
Size: 0x18
Object Name: com.android.server.xxx.Xxxxx$XxxxData
[0x14] private final int a = 0x2
[0x10] private final java.lang.String b = com.xxxx.xxxx
[0x0c] private final java.lang.String c = {xxxxxx}
[0x08] private final java.lang.String d = 11111111
extends java.lang.Record
extends java.lang.Object
[0x04] private transient int shadow$_monitor_ = 0x8ab4cdfe
[0x00] private transient java.lang.Class shadow$_klass_ = 0x13bc56a8
由此可见该类并非为 ClassLoader 对象,因此造成一系列错误。
Java 堆内存被踩?
假设 Java 堆内存 0x11d50c10 被踩踏,由于 ClassLoader 对象内存大小与 XxxxData 对象内存大小不同,那么 0x11d50c10 地址处寻址下一个对象,由于 ClassLoader objsize > XxxxData objsize,因此会导致下一个对象找不到真正的 klass 地址而中止。因此我们需要知道 0x11d50c10 + 0x18 这个地址是否为一个正确对象。
core-parser> p 0x11d50c28
Size: 0x18
Padding: 0x7
Object Name: java.lang.StringBuilder
extends java.lang.AbstractStringBuilder
[0x10] byte coder = 0x0
[0x0c] int count = 0x14
[0x08] byte[] value = 0x11d50c60
extends java.lang.Object
[0x04] private transient int shadow$_monitor_ = 0x0
[0x00] private transient java.lang.Class shadow$_klass_ = 0x6fabf748
继续往下寻址 0x11d50c28 + 0x18,也都是有效对象。
core-parser> p 0x11d50c40
Size: 0x20
Padding: 0x4
Array Name: byte[]
[0] 0x28
[1] 0x70
[2] 0x69
[3] 0x64
[4] 0x3d
[5] 0x31
[6] 0x38
[7] 0x34
[8] 0x32
[9] 0x2c
[10] 0x20
[11] 0x75
[12] 0x69
[13] 0x64
[14] 0x3d
[15] 0x0
因此 Java 堆内存 0x11d50c10 被踩的可能性几乎没有。
为什么虚拟机会错误的认为对象为 ClassLoader ?
我们回到 Object::VisitReferences 的函数上,此过程校验了 classFlags,枚举了 kClassFlagClass、kClassFlagObjectArray、kClassFlagReference、kClassFlagDexCache。此处虚拟机并未为对 kClassFlagClassLoader 类型进行校验,因此才发生此问题。
core-parser> p 0x13bc56a8
Size: 0x100
Class Name: com.android.server.xxx.Xxxxx$XxxxData
info java.lang.Class
[0x076] private transient short virtualMethodsOffset = 0x6
[0x074] private transient short copiedMethodsOffset = 0xd
[0x070] private transient int status = 0xf0000000
[0x06c] private transient int referenceInstanceOffsets = 0x7
[0x068] private transient int primitiveType = 0x20000
[0x064] private transient int objectSizeAllocFastPath = 0x18
[0x060] private transient int objectSize = 0x18
[0x05c] private transient int numReferenceStaticFields = 0x0
[0x058] private transient int numReferenceInstanceFields = 0x3
[0x054] private transient volatile int dexTypeIndex = 0xb0f
[0x050] private transient int dexClassDefIndex = 0x55f
[0x04c] private transient int clinitThreadId = 0x87a
[0x048] private transient int classSize = 0x100
[0x044] private transient int classFlags = 0x800
[0x040] private transient int accessFlags = 0x10
[0x038] private transient long sFields = 0x0
[0x030] private transient long methods = 0x72b2424048
[0x028] private transient long iFields = 0x72b2424000
[0x024] private transient java.lang.Object vtable = 0x0
[0x020] private transient java.lang.Class superClass = 0x6fa79cc8
[0x01c] private transient java.lang.String name = 0x0
[0x018] private transient java.lang.Object[] ifTable = 0x6fa407b8
[0x014] private transient dalvik.system.ClassExt extData = 0x0
[0x010] private transient java.lang.Object dexCache = 0xe5bb558
[0x00c] private transient java.lang.Class componentType = 0x0
[0x008] private transient java.lang.ClassLoader classLoader = 0xe584768
extends java.lang.Object
[0x004] private transient int shadow$_monitor_ = 0x0
[0x000] private transient java.lang.Class shadow$_klass_ = 0x6fabd9b0
我们继续看 XxxxData 类的 classFlags 的值,注意到 classFlags = 0x800 (kClassFlagRecord),得知类是 Record 基类。查看该提交: android-review.googlesource.com/c/platform/… 继承 java.lang.Record 类会在 LinkClass 处更新其 classFlags = kClassFlagRecord,那么只要使用了关键字 record 的对象生成继承 java.lang.Record 类的对象发生 VisitReferences 函数,在该版本下的代码会发生错误。
record 关键字不支持吗?
实际上 Google 已经修复了该问题, 而我们的代码相对较老了,与 android-14.0.0_r1 相当。由于厂商适配工作比应用开发者更早拿到 14 的代码,因此比较接近 r1 的版本。android-14.0.0_r29 上更新。
修复:android-review.googlesource.com/c/platform/…
开发者能用 record 关键字吗?
在 Android Stdio 上更新了 Java17,尝试使用了 record 关键字。
core-parser> class org.penguin.record.MainActivity$RecordTest -f
[0x13360290]
final class org.penguin.record.MainActivity$RecordTest extends com.android.tools.r8.RecordTag {
// Object instance fields:
[0x00c] private final int magic
[0x008] private final java.lang.String name
// extends com.android.tools.r8.RecordTag
// extends java.lang.Object
[0x004] private transient int shadow$_monitor_
[0x000] private transient java.lang.Class shadow$_klass_
}
Android Stdio 使用 record 关键字在编译的程序后,RecordTest 并未直接继承 java.lang.Record 类,而是继承 com.android.tools.r8.RecordTag,与编译器内部处理有关,因此不会作为 ART 虚拟机的基本类型。
后记
初见使用 Java14 新特性导致的系统问题,问题排查并不难却比较为少见,新特性 record 关键字能用,但需慎行。