CFRunloop的多线程隐患

2,120 阅读13分钟

如果你还不了解什么是runloop,可以看这里的详解深入理解RunLoop

苹果官方文档中,声明了CFRunloop是线程安全的:

Thread safety varies depending on which API you are using to manipulate your run loop. The functions in Core Foundation are generally thread-safe and can be called from any thread. If you are performing operations that alter the configuration of the run loop, however, it is still good practice to do so from the thread that owns the run loop whenever possible.

但是需要注意的是,狡猾的苹果使用了generally这个模糊的词。

从实践中来看,CFRunloop在停止runloop的阶段的某些操作是存在多线程隐患的。

不安全的CFRunloopSource

CFRunloop是线程安全的,但是加上CFRunloopSource就不一定了。比如CFSocket。

示例代码

看这样一段自定义线程的代码:

@interface MyThread()
@property (nonatomic, strong) NSThread *currentThread;
@property (nonatomic, assign) CFRunLoopSourceRef socketSource;
@property (nonatomic, assign) CFSocketRef socket;
@property (nonatomic, assign) CFRunLoopRef currentRunloop;

@end

@implementation MyThread

//初始化线程
- (instancetype)init {
    if (self = [super init]) {
        _currentThread = [[NSThread alloc] initWithTarget:self selector:@selector(runThread) object:nil];
    }
    return self;
}

//开启线程;此方法在使用时没有多线程调用
- (void)startThread {
    [self.currentThread start];
}

//线程入口
- (void)runThread {
    @autoreleasepool {
    //返回runloop,可以让其他线程停止此线程
        self.currentRunloop = CFRunLoopGetCurrent();
        [self addSocketSource];
        
        CFRunLoopRun();
    }
    NSLog(@"线程退出");
}

//此方法在使用时没有多线程调用
- (void)stopThread {
	 [self removeSocketSource];
	 @synchronized (_currentRunloop) {
        if (_currentRunloop) {
	        CFRunLoopStop(_currentRunloop);
	        self.currentRunloop = NULL;
	    }
    }
}

//此方法在使用时没有多线程调用
- (void)addSocketSource {
    int sock;
    sock = socket(AF_INET6, SOCK_STREAM, 0);
    CFSocketContext context = {0, (__bridge void *)(self), NULL, NULL, NULL};
    self.socket = CFSocketCreateWithNative(NULL, sock, kCFSocketReadCallBack, socketCallBack, &context);
    self.socketSource = CFSocketCreateRunLoopSource(NULL, self.socket, 0);
    CFRunLoopAddSource(_currentRunloop, _socketSource, kCFRunLoopDefaultMode);
}

- (void)removeSocketSource {
	@synchronized (_socket) {
		if (_socket) {
			//CFSocketInvalidate可能被抛到另一个线程去执行,因此 CFSocketInvalidate 和 CFRunLoopStop可能有多线程同时调用的情况       
	        CFSocketInvalidate(_socket);
	        CFRelease(_socket);
	        self.socket = NULL;
	    }
	}
}

在实践中,CFSocket是被另一个socket类管理的,所以addSocketSourceremoveSocketSource都是在另一个类中的,也就有可能出现CFSocketInvalidateCFRunLoopStop多线程同时调用的情况。

crash实例分析

看上去并没有什么问题,该加锁的地方都加锁了,而且CF开头的那几个方法都是线程安全的。但是这时候,如果出现CFSocketInvalidateCFRunLoopStop多线程同时调用的情况,就有crash的可能。例如我们项目里收到的某个crash:

Thread 0 name:  Dispatch queue: com.apple.main-thread
Thread 0 Crashed:
0   CoreFoundation                  0x000000018e6a9144 CFRunLoopWakeUp + 92
1   CoreFoundation                  0x000000018e6a9140 CFRunLoopWakeUp + 88
2   CoreFoundation                  0x000000018e6d71e8 CFSocketInvalidate + 712
3   MyApp                           0x00000001000fe424 (-[MySocket stop] + 136)
4   MyApp                           0x00000001000fcd50 (-[MySocket dealloc] + 56)
5   libsystem_blocks.dylib          0x000000018d6afa28 _Block_release + 144
6   libdispatch.dylib               0x000000018d65a1bc _dispatch_client_callout + 16
7   libdispatch.dylib               0x000000018d65ed68 _dispatch_main_queue_callback_4CF + 1000
8   CoreFoundation                  0x000000018e77e810 __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__ + 12
9   CoreFoundation                  0x000000018e77c3fc __CFRunLoopRun + 1660
10  CoreFoundation                  0x000000018e6aa2b8 CFRunLoopRunSpecific + 444
11  GraphicsServices                0x000000019015e198 GSEventRunModal + 180
12  UIKit                           0x00000001946f17fc -[UIApplication _run] + 684
13  UIKit                           0x00000001946ec534 UIApplicationMain + 208
14  DuoYiIM                         0x000000010003ca58 0x100024000 + 100952 (main + 132)
15  libdyld.dylib                   0x000000018d68d5b8 start + 4

Thread 0 crashed with ARM-64 Thread State:
  cpsr: 0x0000000020000000     fp: 0x000000016fddab30     lr: 0x000000018e6a9140     pc: 0x000000018e6a9144 
    sp: 0x000000016fddaa00     x0: 0x0000000000000000     x1: 0x0000000000000000    x10: 0x0000000000000000 
   x11: 0x0000000000000000    x12: 0x0000000000000000    x13: 0x0000000000000000    x14: 0x0000000000000000 
   x15: 0x0000000000001203    x16: 0x000000000000012d    x17: 0x000000018f1eef74    x18: 0x0000000000000000 
   x19: 0x000000017056cb50     x2: 0x0000000000001000    x20: 0x000000017056cb40    x21: 0x96e73914144e0055 
   x22: 0x0000000174452990    x23: 0x000000017048bae0    x24: 0x0000000000000000    x25: 0x00000000ffffffff 
   x26: 0xffffffffffffffff    x27: 0x000000017426f1c0    x28: 0x0000000002ffffff    x29: 0x000000016fddab30 
    x3: 0x000000000017e4a6     x4: 0x0000000000012068     x5: 0x0000000000000000     x6: 0x0000000000000036 
    x7: 0xffffffffffffffec     x8: 0x8c8c8c8c8c8c8c8c     x9: 0x000000000000000c

CFSocketInvalidate在主线程被调用了。看堆栈,在CFSocketInvalidate内部调用CFRunLoopWakeUp时,出现了crash。

看不出具体是什么原因crash,所以需要看看是在CFRunLoopWakeUp的哪里挂的。查看对应版本的CoreFoundation的汇编代码:

_CFRunLoopWakeUp:
0x0000000181521b9c FF0305D1               sub        sp, sp, #0x140             ; CODE XREF=_CFRunLoopAddTimer+696, _CFRunLoopTimerSetNextFireDate+592, _CFSocketInvalidate+708, __wakeUpRunLoop+276, __CFXRegistrationPost+344, -[CFPrefsSearchListSource asynchronouslyNotifyOfChangesFromDictionary:toDictionary:]+172, ___CFSocketPerformV0+1408, ___CFSocketManager+2004, ___CFSocketManager+4248, _boundPairRead+604, _boundPairReadClose+124, …
0x0000000181521ba0 FC6F11A9               stp        x28, x27, [sp, #0x110]
0x0000000181521ba4 F44F12A9               stp        x20, x19, [sp, #0x120]
0x0000000181521ba8 FD7B13A9               stp        x29, x30, [sp, #0x130]
0x0000000181521bac FDC30491               add        x29, sp, #0x130
0x0000000181521bb0 F40300AA               mov        x20, x0
0x0000000181521bb4 C80C10F0               adrp       x8, #0x1a16bc000
0x0000000181521bb8 084140F9               ldr        x8, [x8, #0x80]            ; -[_CFXPreferences init]_1a16bc080
0x0000000181521bbc 080140F9               ldr        x8, [x8]
0x0000000181521bc0 292013F0               adrp       x9, #0x1a7928000
0x0000000181521bc4 29E90791               add        x9, x9, #0x1fa             ; ___CF120290
0x0000000181521bc8 A8831DF8               stur       x8, [x29, #-0x28]
0x0000000181521bcc E8030032               orr        w8, wzr, #0x1
0x0000000181521bd0 28010039               strb       w8, [x9]                   ; ___CF120290
0x0000000181521bd4 E8731290               adrp       x8, #0x1a639d000
0x0000000181521bd8 08F13F91               add        x8, x8, #0xffc             ; ___CF120293
0x0000000181521bdc 08014039               ldrb       w8, [x8]                   ; ___CF120293
0x0000000181521be0 48000034               cbz        w8, loc_181521be8

0x0000000181521be4 E3560394               bl         ___THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__

                                      loc_181521be8:
0x0000000181521be8 93420091               add        x19, x20, #0x10            ; CODE XREF=_CFRunLoopWakeUp+68
0x0000000181521bec E00313AA               mov        x0, x19
0x0000000181521bf0 70300694               bl         imp___stubs_-[NSOrderedSet sortedArrayFromRange:options:usingComparator:]//真机的系统库做了混淆,这里其实是__CFRunLoopLock
0x0000000181521bf4 882E40F9               ldr        x8, [x20, #0x58]
0x0000000181521bf8 080D40B9               ldr        w8, [x8, #0xc]
0x0000000181521bfc A8010034               cbz        w8, loc_181521c30

crash日志中,崩溃在CFRunLoopWakeUp + 92,对应汇编地址为0x0000000181521b9c + 92=0x0000000181521bf8,在ldr w8, [x8, #0xc]的时候挂了。查看crash时寄存器的值,x8: 0x8c8c8c8c8c8c8c8c,很明显x8指向的内存已经被释放了。x8是从ldr x8, [x20, #0x58]得来的(也就是x20的地址偏移0x58后的值),而x20则是从mov x20, x0得来的,x0就是CFRunloopWakeUp的第一个参数,CFRunLoopRef结构体,所以x8就是CFRunLoopRef偏移0x58后的值。

CoreFoundation的代码是开源的,可以在这里下载:CF-1153.18

对应CFRunloopWakeUp源码:

void CFRunLoopWakeUp(CFRunLoopRef rl) {
    CHECK_FOR_FORK();
    __CFRunLoopLock(rl);
    if (__CFRunLoopIsIgnoringWakeUps(rl)) {
        __CFRunLoopUnlock(rl);
        return;
    }
    kern_return_t ret;
    ret = __CFSendTrivialMachMessage(rl->_wakeUpPort, 0, MACH_SEND_TIMEOUT, 0);
    if (ret != MACH_MSG_SUCCESS && ret != MACH_SEND_TIMED_OUT) CRASH("*** Unable to send message to wake up port. (%d) ***", ret);
    __CFRunLoopUnlock(rl);
}

CF_INLINE Boolean __CFRunLoopIsIgnoringWakeUps(CFRunLoopRef rl) {
    return (rl->_perRunData->ignoreWakeUps) ? true : false;    
}

CFRunloop结构体:

struct __CFRunLoop {
    CFRuntimeBase _base;	//16 byte
    pthread_mutex_t _lock;	//64 byte
    __CFPort _wakeUpPort; //mach_port_t (unsign int), 4 byte
    Boolean _unused;	//bool变量占用1 byte,但是需要和4字节对齐,所以也是4 byte
    volatile _per_run_data *_perRunData;
    pthread_t _pthread;
    uint32_t _winthread;
    CFMutableSetRef _commonModes;
    CFMutableSetRef _commonModeItems;
    CFRunLoopModeRef _currentMode;
    CFMutableSetRef _modes;
    struct _block_item *_blocks_head;
    struct _block_item *_blocks_tail;
    CFAbsoluteTime _runTime;
    CFAbsoluteTime _sleepTime;
    CFTypeRef _counterpart;
};

typedef struct __CFRuntimeBase {
    uintptr_t _cfisa;	//unsigned long 8 byte
    uint8_t _cfinfo[4];	//unsigned char 4 byte
#if __LP64__
    uint32_t _rc;	//unsigned int 4 byte
#endif
} CFRuntimeBase;

struct pthread_mutex_t {
	long __sig;	//8 byte
	char __opaque[56]; //56 byte
};

计算结构体size后,得出ldr x8, [x20, #0x58]就是runloop-> _perRunData。也就是在调用__CFRunLoopIsIgnoringWakeUps的时候,CFRunLoopRef已经被释放了。

分析CFSocket源码

查看CFSocketInvalidate源码:

void CFSocketInvalidate(CFSocketRef s) {
    CHECK_FOR_FORK();
    CFRetain(s);
    __CFLock(&__CFAllSocketsLock);
    __CFSocketLock(s);
    if (__CFSocketIsValid(s)) {
    
        //省略部分代码...

		 //取出socket中的runloop数组
        CFArrayRef runLoops = (CFArrayRef)CFRetain(s->_runLoops);
       //CFRunloop释放操作1       
        CFRelease(s->_runLoops);
        
        s->_runLoops = NULL;
        
        //省略部分代码...
        
        __CFSocketUnlock(s);
        
        // Do this after the socket unlock to avoid deadlock (10462525)
        for (idx = CFArrayGetCount(runLoops); idx--;) {
            CFRunLoopWakeUp((CFRunLoopRef)CFArrayGetValueAtIndex(runLoops, idx));
        }
        //CFRunloop释放操作3
        CFRelease(runLoops);

        //省略部分代码...
    } else {
        __CFSocketUnlock(s);
    }
    __CFUnlock(&__CFAllSocketsLock);
    CFRelease(s);
}

CFSocketInvalidate中唯一使用到CFRunLoopWakeUp的地方,就是最后遍历runloops的操作。 但是此时CFRunLoopRef还在数组里,正在被数组强引用,到了CFRunLoopWakeUp里怎么就被释放了呢?

注意,CFSocketInvalidate里遍历runloops的操作是在锁外面进行的,说明CFSocket很有可能没有管理好它的runloops数组,导致数组在遍历时被释放了。从Do this after the socket unlock to avoid deadlock (10462525)这一行注释猜测,这部分遍历操作之前应该也是在锁内的,但是会出现死锁,所以放到了锁外。苹果的bug report是不对外公开的,只在这里找到了可能相关的讨论:bug #10462525

最大的可能是出现在__CFSocketCancel里。在runloop停止的时候,也会执行remove source操作,在CFRunLoopRemoveSource里,会执行source0的cancel函数,也就是__CFSocketCancel

void CFRunLoopRemoveSource(CFRunLoopRef rl, CFRunLoopSourceRef rls, CFStringRef modeName) \
    CHECK_FOR_FORK();
    Boolean doVer0Callout = false, doRLSRelease = false;
    __CFRunLoopLock(rl);
    if (modeName == kCFRunLoopCommonModes) {
	//省略代码...
    } else {
	CFRunLoopModeRef rlm = __CFRunLoopFindMode(rl, modeName, false);
	if (NULL != rlm && ((NULL != rlm->_sources0 && CFSetContainsValue(rlm->_sources0, rls)) || (NULL != rlm->_sources1 && CFSetContainsValue(rlm->_sources1, rls)))) {
	    CFRetain(rls);
	    //省略代码...
	    if (0 == rls->_context.version0.version) {
	        if (NULL != rls->_context.version0.cancel) {
	            doVer0Callout = true;
	        }
	    }
	    doRLSRelease = true;
	}
        //省略代码...
	}
    }
    __CFRunLoopUnlock(rl);
    if (doVer0Callout) {
        // although it looses some protection for the source, we have no choice but
        // to do this after unlocking the run loop and mode locks, to avoid deadlocks
        // where the source wants to take a lock which is already held in another
        // thread which is itself waiting for a run loop/mode lock
        rls->_context.version0.cancel(rls->_context.version0.info, rl, modeName);	/* CALLOUT */
    }
    if (doRLSRelease) CFRelease(rls);
}

__CFSocketCancel源码:

static void __CFSocketCancel(void *info, CFRunLoopRef rl, CFStringRef mode) {
    CFSocketRef s = (CFSocketRef)info;
    __CFSocketLock(s);
    if (0 == s->_socketSetCount) {
        //省略代码...
    if (NULL != s->_runLoops) {
    //从runloops数组中移除此runloop;对原数组执行拷贝后,释放原数组
        CFMutableArrayRef runLoopsOrig = s->_runLoops;
        CFMutableArrayRef runLoopsCopy = CFArrayCreateMutableCopy(kCFAllocatorSystemDefault, 0, s->_runLoops);
        idx = CFArrayGetFirstIndexOfValue(runLoopsCopy, CFRangeMake(0, CFArrayGetCount(runLoopsCopy)), rl);
        if (0 <= idx) CFArrayRemoveValueAtIndex(runLoopsCopy, idx);
        s->_runLoops = runLoopsCopy;
        //CFRunloop释放操作2
        CFRelease(runLoopsOrig);
    }
    __CFSocketUnlock(s);
}

__CFSocketCancel也有一次对CFRunloopRef的释放操作,加上CFSocketInvalidate里的2个,总共有3个释放操作。

所以,如果__CFSocketCancelCFSocketInvalidate在多线程同时执行,就有可能出现对CFSocket中的runloops数组过度释放,因此在遍历runloops的时候就会出现CFRunLoopRef被释放的情况。虽然这个crash出现的概率比较低,但是在项目里隔一段时间就会稳定出现。

所以,不是加了锁就万事大吉了,CFSocketInvalidate里在遍历数组前应该再加一个retain才能保证安全。

解决方法

  • 既然是CFSocket里的bug,那就只能避免不要出现CFSocketInvalidateCFRunloopStop多线程执行的代码。
  • 如果你的socket只在这个线程里运行,那直接调用CFRunloopStop即可,runloop会自动清理所有source。
  • 如果这个线程需要重用,那就不需要stop,而是停止socket后,在同一个线程里新建socket。

自动停止的Runloop

那么,如果把stop代码改成这样,应该就没问题了吧?

- (void)runThread {
    @autoreleasepool {
        self.currentRunloop = CFRunLoopGetCurrent();
        [self addRunloopSource];
        [self addSocketSource];
        
        CFRunLoopRun();
    }
    NSLog(@"线程退出");
}

- (void)stopThread {
    if (_currentRunloop) {
	    //保证removeSocketSource的操作只会在这里执行,没有多线程的情况
        [self removeSocketSource];
        CFRunLoopStop(_currentRunloop);
        self.currentRunloop = NULL;
    }
}

很遗憾,这样写还是不安全的。

原因在于removeSocketSource之后,runloop里source就全部为空了,runloop如果检测到了source为空,就会自动停止runloop循环,销毁线程。

因此如果你在另一个线程调用stopThread,在removeSocketSource之后线程就会随时停止,runloop在调用CFRunLoopStop时可能已经被释放了。

上面的写法出现crash的概率太低,但是稍微改一下就能必现:

- (void)stopThread {
    if (_currentRunloop) {
        [self removeSocketSource];
        
        //插入一个耗时操作
        sleep(2);
        //必定crash
        CFRunLoopStop(_currentRunloop);
        self.currentRunloop = NULL;
    }
}

这种情况下crash的原因其实是没做好内存管理,只要对runloop增加一次retain操作就没问题了:

- (void)runThread {
    @autoreleasepool {
	    //做一次retain操作
        self.currentRunloop = CFRetain(CFRunLoopGetCurrent());
        [self addRunloopSource];
        [self addSocketSource];
        
        CFRunLoopRun();
    }
    NSLog(@"线程退出");
}

- (void)stopThread {
    if (_currentRunloop) {
        [self removeSocketSource];
        CFRunLoopStop(_currentRunloop);
        CFRelease(_currentRunloop);
        self.currentRunloop = NULL;
    }
}

结论

在使用runloop source的时候要谨慎,尤其在处理stop的阶段。其他source可能也存在类似的问题。

一个变量有多线程操作的时候,在锁外的操作即使是只读也是不安全的,在读取之前最好再做一次retain操作,防止在读取的过程中被释放。