一、背景
8.54.0.0版本上线后,出现了大量的socket clientsdk.so库的闪退:
主要分为两种:
1、后台闪退:用户无感知:
2、前台闪退:用户有感知,主要体现为 启动app后立刻就闪退,然后重启后正常;
二、定位问题
1、通过用户反馈,并让用户安装测试包,可以真实的证明,确实是socket so库导致的crash:
2、由于堆栈并没有指明是so库的哪个地方闪退,所以,无法进行直接修复,开始进行推论:
2.1、 8.54.0.0版本并没有升级socket so库,并且通过反编译对比两个线上版本,并不存在lib_c++ shared的库被升级,因此可以排除是因为升级其他sdk导致的socket 闪退;
2.2、8.54.0.0版本升级了bugly sdk,由于bugly sdk进行了优化,导致之前过滤socket so库后台闪退的记录失效,因此出现大量上报,可以理解,但不应该出现前台闪退;
2.3、通过和多个用户沟通,发现用户并不能稳定复现,只是有时候会出现,但是明显8.54.0.0版本的频率比以前高很多,通过让测试多次安装测试包的方法并没有复现,应该是8.54.0.0版本有什么东西引发了socket so库的闪退;
2.4、从日志里找线索,具体日志如下:
#00 pc 000000000005205c /apex/com.android.runtime/lib64/bionic/libc.so (abort+164) [arm64-v8a::82e5b2ff86b193c94139353a92c4af29]
2
#01 pc 00000000000667d0 /apex/com.android.runtime/lib64/bionic/libc.so (__stack_chk_fail+20) [arm64-v8a::82e5b2ff86b193c94139353a92c4af29]
3
#02 pc 0000000000067d5c /data/app/~~jnhT7CUNDeCebak8i1CqNw==/com.xx.seeyou-oFoQho_36LUWZWiebvJIkg==/lib/arm64/libclientsdk.so [arm64-v8a::de63b5e1d2007ebbad9c3e7b9001d192]
4
#03 pc 0000000000067e4c /data/app/~~jnhT7CUNDeCebak8i1CqNw==/com.xx.seeyou-oFoQho_36LUWZWiebvJIkg==/lib/arm64/libclientsdk.so [arm64-v8a::de63b5e1d2007ebbad9c3e7b9001d192]
5
#04 pc 000000000006cb78 /data/app/~~jnhT7CUNDeCebak8i1CqNw==/com.xx.seeyou-oFoQho_36LUWZWiebvJIkg==/lib/arm64/libclientsdk.so [arm64-v8a::de63b5e1d2007ebbad9c3e7b9001d192]
6
#05 pc 00000000000732d8 /data/app/~~jnhT7CUNDeCebak8i1CqNw==/com.xx.seeyou-oFoQho_36LUWZWiebvJIkg==/lib/arm64/libclientsdk.so [arm64-v8a::de63b5e1d2007ebbad9c3e7b9001d192]
7
#06 pc 00000000000742b0 /data/app/~~jnhT7CUNDeCebak8i1CqNw==/com.xx.seeyou-oFoQho_36LUWZWiebvJIkg==/lib/arm64/libclientsdk.so [arm64-v8a::de63b5e1d2007ebbad9c3e7b9001d192]
8
#07 pc 00000000000748bc /data/app/~~jnhT7CUNDeCebak8i1CqNw==/com.xx.seeyou-oFoQho_36LUWZWiebvJIkg==/lib/arm64/libclientsdk.so [arm64-v8a::de63b5e1d2007ebbad9c3e7b9001d192]
9
#08 pc 00000000000b3ea0 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+264) [arm64-v8a::82e5b2ff86b193c94139353a92c4af29]
10
#09 pc 0000000000053880 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) [arm64-v8a::82e5b2ff86b193c94139353a92c4af29]
从以上日志可以发现两个关键点:
- 1、bionic/libc.so (__stack_chk_fail+20)
- 此日志说明一般是因为内存操作导致,也就是说内存相关引发的;
-2、bionic/libc.so (__start_thread+64) [arm64-v8a::82e5b2ff86b193c94139353a92c4af29]
- 此日志说明是在开启线程的时候挂掉的
- 通过分析bugly的跟踪日志我们发现了另外一段系统日志:
04-07 10:51:32.253 8833 8894 E clientsdk: [clientsdk][lvl:1] [KVS] init. version: 0, key count: 0 /data/docker/im-group/im-sdk/jniwks/MeetYou/jni/android/com_rtc_RTCClient.cpp:879204-07
10:51:32.315 8833 11111 E clientsdk: [clientsdk][lvl:1] workThreadProcess start, thread id is 498356722864 /data/docker/im-group/im-sdk/jniwks/MeetYou/jni/client_sdk/client_core.cpp:1589304-07 10:51:32.433 8833 8833
E SysUtils: [variable fonts] error to areFontsVariable:java.lang.ClassNotFoundException: com.huawei.android.graphics.fonts.SystemFontsEx94--------- beginning of crash9504-07 10:51:32.584 8833 11111 F libc : stack corruption detected (-fstack-protector)9604-07 10:51:32.662 8833 11227 E Oms-SDK.WearableApiManager: Service missing when getting application info9804-07 10:51:32.990 8833 8833 E HwResourcesImpl: handleAddIconBackground resId = 0 return: android.graphics.drawable.ColorDrawable@a2e5c3f12304-07 10:51:33.388 8833 8870 E summer : not found implements method com.xx.seeyou.protocol.GaStubImpladdGaOtherParams" !!!!!!!!!!!!!!12804-07 10:51:33.876 8833 11111 E eup : get abort message after Q12904-07 10:51:33.897 8833 8833 E lgr : getToolHistoryList, babyId: 53695904, commonId: 22064617613004-07 10:51:33.897 8833 10392 E lgr : GetToolHistoryWorker, babyId: 53695904, commonId: 22064617613104-07 10:51:33.898 8833 10392 E lgr : shouldUpdate: false13204-07 10:51:33.914 8833 8833 E lgr : getToolHistoryList, size:
我们大胆的猜测:socket 进行了初始化,并创建了线程,并且线程体开始执行,然后因为内存问题出现了意想不到的情况,然后就蹦了;
2.5、验证猜想:
-
最常见的C++空指针,于是我们模拟了一下C++空指针,信息如下,和实际线上日志不符,因此排除空指针的情况,此项不成立
Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x9c in tid 30316 (example.meetyou), pid 30272 (example.meetyou)
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A Softversion: PD2020C_A_7.10.11
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A Time: 2023-04-07 15:16:59
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A Build fingerprint: 'vivo/PD2020/PD2020:10/QP1A.190711.020/compiler02071532:user/release-keys'
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A Revision: '0'
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A ABI: 'arm64'
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A Timestamp: 2023-04-07 15:16:59+0800
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A pid: 30272, tid: 30316, name: example.meetyou >>> com.example.meetyou <<<
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A uid: 10220
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x9c
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A Cause: null pointer dereference
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A x0 00000000000000a0 x1 00000070c804e5e0 x2 00000070c8000000 x3 0000000000000009
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A x4 000000000000004e x5 ffffffffffff8000 x6 27650615173e0000 x7 0080ffffffffffff
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A x8 000000000000009c x9 ab47f973a0173289 x10 00000000000000e1 x11 0000000000000094
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A x12 0000000000000008 x13 ffffffffffffffff x14 ffffffffffff0000 x15 ffffffffffffffff
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A x16 0000007156b2e768 x17 0000007156b226bc x18 0000007064fba000 x19 0000000000000093
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A x20 00000070c804e5e0 x21 00000000000000a0 x22 00000070b8e112b0 x23 0000007068965020
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A x24 0000007068964d50 x25 0000007068964d50 x26 0000007068965020 x27 000000715a8ba020
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A x28 0000007feeda38b0 x29 0000007068964cf0
2023-04-07 15:16:59.288 30320-30320 DEBUG crash_dump64 A sp 00000070689643c0 lr 0000007065fa5d90 pc 0000007065fa5d94
2023-04-07 15:16:59.296 30320-30320 DEBUG crash_dump64 A
backtrace:
2023-04-07 15:16:59.296 30320-30320 DEBUG crash_dump64 A #00 pc 00000000000e4d94 /data/app/com.example.meetyou-OAuno40ZMifLuMg-8Oq6NQ==/lib/arm64/libclientsdk.so (meet_you::ClientCore::workThreadProcess(meet_you::ClientCore*)+356) (BuildId: 7e94cedda384a753b36878824c12c0b46777c51c)
2023-04-07 15:16:59.296 30320-30320 DEBUG crash_dump64 A #01 pc 00000000000d6ed8 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36) (BuildId: bd43e969ad4563b25263b9efb013f03d)
2023-04-07 15:16:59.296 30320-30320 DEBUG crash_dump64 A #02 pc 0000000000075314 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: bd43e969ad4563b25263b9efb013f03d) -
模拟数组越界:c->m_noti_sock_pair[3] = NULL; 日志和线上闪退不匹配,同时验证了C++的try catch(...) 三个点try catch是没有用的;此项不成立
Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xd4c9b768 in tid 9456 (example.meetyou), pid 9390 (example.meetyou)
2023-04-07 15:46:06.833 9552-9552 DEBUG pid-9552 A Softversion: PD2020C_A_7.10.11
2023-04-07 15:46:06.833 9552-9552 DEBUG pid-9552 A Time: 2023-04-07 15:46:06
2023-04-07 15:46:06.833 9552-9552 DEBUG pid-9552 A *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
2023-04-07 15:46:06.833 9552-9552 DEBUG pid-9552 A Build fingerprint: 'vivo/PD2020/PD2020:10/QP1A.190711.020/compiler02071532:user/release-keys'
2023-04-07 15:46:06.833 9552-9552 DEBUG pid-9552 A Revision: '0'
2023-04-07 15:46:06.833 9552-9552 DEBUG pid-9552 A ABI: 'arm64'
2023-04-07 15:46:06.833 9552-9552 DEBUG pid-9552 A Timestamp: 2023-04-07 15:46:06+0800
2023-04-07 15:46:06.833 9552-9552 DEBUG pid-9552 A pid: 9390, tid: 9456, name: example.meetyou >>> com.example.meetyou <<<
2023-04-07 15:46:06.833 9552-9552 DEBUG pid-9552 A uid: 10220
2023-04-07 15:46:06.833 9552-9552 DEBUG pid-9552 A signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xd4c9b768
2023-04-07 15:46:06.833 9552-9552 DEBUG pid-9552 A x0 0000000000000000 x1 0000000000000000 x2 0000000000000000 x3 0000000000000004
2023-04-07 15:46:06.834 9552-9552 DEBUG pid-9552 A x4 00000000000001e3 x5 0000000000000000 x6 000000715bb4c000 x7 0000000003cc65aa
2023-04-07 15:46:06.834 9552-9552 DEBUG pid-9552 A x8 00000070b8e138d0 x9 0000000000000000 x10 00000000d4c9b768 x11 00000070c3ef2340
2023-04-07 15:46:06.834 9552-9552 DEBUG pid-9552 A x12 00000070c3ef2370 x13 00000070b8e13850 x14 00000070c3a26540 x15 aaaaaaaaaaaaaaab
2023-04-07 15:46:06.834 9552-9552 DEBUG pid-9552 A x16 0000007066ff5a20 x17 0000007156b2298c x18 0000007065f78000 x19 00000070b8e13808
2023-04-07 15:46:06.834 9552-9552 DEBUG pid-9552 A x20 00000070b8e13900 x21 00000070c3ef2340 x22 00000070c3ef3020 x23 000001875aadc540
2023-04-07 15:46:06.834 9552-9552 DEBUG pid-9552 A x24 0000000000002710 x25 00000070b8e138a4 x26 0000000010624dd3 x27 00000000000003e8
2023-04-07 15:46:06.834 9552-9552 DEBUG pid-9552 A x28 000000000000003f x29 00000070c3ef23b0
2023-04-07 15:46:06.834 9552-9552 DEBUG pid-9552 A sp 00000070c3ef2310 lr 0000007066ee9bc4 pc 0000007066ee9c18
2023-04-07 15:46:06.836 9552-9552 DEBUG pid-9552 A
backtrace:
2023-04-07 15:46:06.836 9552-9552 DEBUG pid-9552 A #00 pc 00000000000e6c18 /data/app/com.example.meetyou-_ETwzjbYKfyqCxBK-Q7m9Q==/lib/arm64/libclientsdk.so (meet_you::ClientCore::processOps()+156) (BuildId: c252c9cd10174f2f3fd5cd18fc836db3a70f6f0f)
2023-04-07 15:46:06.836 9552-9552 DEBUG pid-9552 A #01 pc 00000000000e4dfc /data/app/com.example.meetyou-_ETwzjbYKfyqCxBK-Q7m9Q==/lib/arm64/libclientsdk.so (meet_you::ClientCore::workThreadProcess(meet_you::ClientCore*)+460) (BuildId: c252c9cd10174f2f3fd5cd18fc836db3a70f6f0f)
2023-04-07 15:46:06.836 9552-9552 DEBUG pid-9552 A #02 pc 00000000000d6ed8 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36) (BuildId: bd43e969ad4563b25263b9efb013f03d)
2023-04-07 15:46:06.836 9552-9552 DEBUG pid-9552 A #03 pc 0000000000075314 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: bd43e969ad4563b25263b9efb013f03d)
-
尝试禁用栈检查机制:修改so编译库:Android.mk加入 LOCAL_CPPFLAGS += -fno-stack-protector,已重新打包debug so发给闪退用户使用,经过多个用户测试后,发现依然存在闪退行为;此项不成立
-
使用-fno-stack-protector后,打包release so库,本地仅复现过一次闪退,但重新覆盖安装后debug.so库后,无法再复现闪退,再装回release.so库,也无法再复现闪退,日志如下:(信息不足)
---------------------------- PROCESS STARTED (8996) for package com.xx.seeyou ----------------------------
2023-04-02 10:05:04.843 8996-9126 CrashHandl com.xx.seeyou E meiyou : crashType:2, errorType:SIGABRT, errorMessage:, errorStack:#00 pc 0000000000089e84 /apex/com.android.runtime/lib64/bionic/libc.so (abort+164) [arm64-v8a::e18cca17d252ede5b01226139ce195f2]
#01 pc 000000000064d5e8 /apex/com.android.art/lib64/libart.so (_ZN3art7Runtime5AbortEPKc+1708) [arm64-v8a::ecc30d06a84b114b9c9e4ae24d9a7c47]
#02 pc 00000000000159e0 /apex/com.android.art/lib64/libbase.so [arm64-v8a::c2223684f28269a88f71e985c49cbb2f]
#03 pc 0000000000015004 /apex/com.android.art/lib64/libbase.so (_ZN7android4base10LogMessageD1Ev+484) [arm64-v8a::c2223684f28269a88f71e985c49cbb2f]
#04 pc 00000000003d5aac /apex/com.android.art/lib64/libart.so (_ZN3art22IndirectReferenceTable17AbortIfNoCheckJNIERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE+244) [arm64-v8a::ecc30d06a84b114b9c9e4ae24d9a7c47]
#05 pc 00000000003d7844 /apex/com.android.art/lib64/libart.so (_ZN3art22IndirectReferenceTable6RemoveENS_15IRTSegmentStateEPv+3640) [arm64-v8a::ecc30d06a84b114b9c9e4ae24d9a7c47]
#06 pc 00000000004603a4 /apex/com.android.art/lib64/libart.so (_ZN3art9JavaVMExt15DeleteGlobalRefEPNS_6ThreadEP8_jobject+84) [arm64-v8a::ecc30d06a84b114b9c9e4ae24d9a7c47]
#07 pc 00000000000d5070 /data/app/~~w8GGWhLj--AbLzFAzq40EQ==/com.xx.seeyou-vpqDjs8P8k8lZ8w4q1JFLQ==/lib/arm64/libclientsdk.so [arm64-v8a::f65bcd25c4c461c3fb1961b981513b35]
#08 pc 00000000000d50d4 /data/app/~~w8GGWhLj--AbLzFAzq40EQ==/com.xx.seeyou-vpqDjs8P8k8lZ8w4q1JFLQ==/lib/arm64/libclientsdk.so [arm64-v8a::f65bcd25c4c461c3fb1961b981513b35]
#09 pc 00000000000ef390 /apex/com.android.runtime/lib64/bionic/libc.so (__cxa_finalize+288) [arm64-v8a::e18cca17d252ede5b01226139ce195f2]
#10 pc 00000000000e0b10 /apex/com.android.runtime/lib64/bionic/libc.so (exit+24) [arm64-v8a::e18cca17d252ede5b01226139ce195f2]
#11 pc 0000000000023798 /apex/com.android.art/lib64/libjdwp.so (forceExit+24) [arm64-v8a::d2d34c986fb9607178347cb43cc21125]
#12 pc 000000000001f21c /apex/com.android.art/lib64/libjdwp.so [arm64-v8a::d2d34c986fb9607178347cb43cc21125]
#13 pc 0000000000025620 /apex/com.android.art/lib64/libjdwp.so (debugLoop_run+680) [arm64-v8a::d2d34c986fb9607178347cb43cc21125]
#14 pc 0000000000038624 /apex/com.android.art/lib64/libjdwp.so [arm64-v8a::d2d34c986fb9607178347cb43cc21125]
#15 pc 00000000000e1b90 /apex/com.android.art/lib64/libopenjdkjvmti.so [arm64-v8a::d5ba8a16bc15d047debc49ac6462ad5a]
#16 pc 00000000000ebbb0 /apex/com.android.runtime/lib64/bionic/libc.so (_ZL15__pthread_startPv+264) [arm64-v8a::e18cca17d252ede5b01226139ce195f2]
#17 pc 000000000008b6a8 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) [arm64-v8a::e18cca17d252ede5b01226139ce195f2]
java:
[Failed to get Java stack]
---------------------------- PROCESS ENDED (8996) for package com.xx.seeyou ----------------------------
3、至此,陷入死胡同,感觉没有了思路,已耗费了大量时间,测试也花费了大量人力测试无果,无法复现,最终感觉还是只能从日志上寻找蛛丝马迹;
之前困扰的一个很大的问题在于:为什么socket库没有更新,为何又会报闪退?在经过上方验证后,已经排除是第三方SDK 通过hook系统api造成的,那唯一的可能性就是升级了bugly,统计方式有变化?
通过再次查找bugly所有后台闪退日志,忽然发现,这个socket闪退,在很久以前的版本就出现过,只不过后来因为都是后台闪退,被我们屏蔽了,就没再上报了,
所以就证明了,不是这个版本socket会闪退,是以前一直会闪退,只不过为什么这个版本变成了前台闪退????那就是统计方式有问题,于是决定回退bugly版本;
但是带来了一个担忧:buglysdk 降级 会不会产生另外的闪退;经过和bugly那边沟通,对方不建议降级,无法保证风险,但为了解决闪退,依然需要进行降级尝试和验证,于是:
- 1、通过重新构建socket库,模拟了一个c++的空指针,然后打包成为新库;
- 2、基于8.54.0.0版本的分支,集成此库,为保证闪退日志不被上报,进行断网,模拟出8.54.0.0版本的闪退;然后看看升级到新版本的旧版本bugly sdk后会不会闪退;
- 3、然后基于8.55.0.0版本的分支,集成旧版本bugly sdk库,覆盖安装8.54.0.0版本后,惊讶的发现出现了概率性闪退,和用户的场景一模一样!
- 4、于是又基于8.55.0.0版本的分支,集成新版本bugly sdk库,覆盖安装8.54.0.0版本后,发现y也出现了概率性闪退,也和用户的场景一模一样!柳岸花明,成功复现。
4、分析根因
- 1、首先,用户一定是发生过了socket 闪退,大概率和以前一样,是在后台发生了socket闪退,并且是没有上报成功,保存在了bugly的缓存里;这一点,8.54.0.0版本才会发生,因为之前的版本,bugly是不处理socket闪退的,因为屏蔽了。
- 2、为什么启动会概率性闪退,经过多次模拟和验证,发现有一个bugly 本身的一个bug,当初始化bugly后,bugly在上报闪退日志的时候,会偶然的触发onCrashStart的方法,就是这一步,导致了bugly平台上出现了一部分让我们非常费解的的前台socket闪退,
然后我们判断是第一点保存的闪退日志后,会直接kill掉进程,用于屏蔽后台闪退上报;所以用户出现了不能稳定复现的情况;
- 3、为什么难以复现?因为复现的前提是,要发生过socket闪退,且bugly没有上报成功,且要用bugly的新版本sdk;
- 4、为什么用户卸载后就好了?因为卸载后,bugly没存储闪退数据了,就不会偶然的触发onCrashStart方法,就不会闪退了;
5、解决方案:
- 1、保持bugly升级策略不变,新增hook方法,针对后台闪退进行屏蔽;
- 2、移除kill process方法;
6、关于socket 库后台闪退问题有了一个新的想法:
当前由于编译的都是release的so库,我们可以考虑挑一些用户下发debug的so库,这样可以应该可以统计具体c++代码行数,可以分析为什么有这么多的后台闪退;
(然而,事情还没有结束,此版本上线后依然存在较多用户闪退,请看后续2)
===============================================
===============================================
参考文档:
- stack_chk_fail相关:
- 各种native crash相关:
add2line native分析方法:
[ice@icedeMacBook-Pro ~] find . -name *addr2line
zsh: no matches found: *addr2line
[ice@icedeMacBook-Pro ~] cd android-ndk-r19c
[ice@icedeMacBook-Pro android-ndk-r19c] cd /Users/ice/Downloads/android-ndk-r19c/toolchains/aarch64-linux-android-4.9/prebuilt/darwin-x86_64/bin
[ice@icedeMacBook-Pro bin] ./aarch64-linux-android-addr2line -f -e /Users/ice/MeiyouCode/imsdk/app/build/intermediates/ndkBuild/debug/obj/local/arm64-v8a/libclientsdk.so 00000000000e2f80
ZN8meet_you10ClientCore17workThreadProcessEPS0
/Users/ice/MeiyouCode/imsdk/app/src/main/jni/client_sdk/client_core.cpp:163
[ice@icedeMacBook-Pro bin]$