记一次 Python Popen 阻塞的问题在一次更新中，我们的 Python Agent 增加了一个新的子进程。上线后

640?wx_fmt=jpeg

Valens

网易游戏运维与基础架构部产品软件设计师专注于智能监控、性能优化等领域

背景

在一次更新中，我们的 Python Agent 增加了一个新的子进程。上线后发现，主进程中，监控该子进程的插件线程小概率会被阻塞，阻塞点的调用函数为subprocess.Popen，逻辑为跑一个ss命令采集本地连接情况。

现场

拿到一个保留现场后，简单查看下，父进程子进程状态都比较正常。二话不说先用 Pylane Shell 打进去，看看阻塞的线程在干嘛：

 1In [25]: tools.print_threads(["pusher_keep_alive"])
 2Out[25]:
 3
 4Thread: pusher_keep_alive
 5Stack:
 6......
 7  File "/usr/local/gpostd_virtualenv/local/lib/python2.7/site-packages/libase/multirun/common.py", line 237, in run_cmdline_with_file
 8    env=env,
 9  File "/usr/local/gpostd_virtualenv/local/lib/python2.7/site-packages/subprocess32.py", line 825, in __init__
10    restore_signals, start_new_session)
11  File "/usr/local/gpostd_virtualenv/local/lib/python2.7/site-packages/subprocess32.py", line 1537, in _execute_child
12    part = _eintr_retry_call(os.read, errpipe_read, 50000)
13  File "/usr/local/gpostd_virtualenv/local/lib/python2.7/site-packages/subprocess32.py", line 546, in _eintr_retry_call
14    return func(*args)
15
16Locals:
17 {'args': (18, 50000), 'func': <built-in function read>}
18

卡在了 subprocess32 库调用 os.read 的地方，关键代码：

 11330         def _execute_child(self, args, executable, preexec_fn, close_fds,
 2......
 31356             errpipe_read, errpipe_write = _create_pipe() # 创建一个pipe，拿到读写fd
 4......
 51358                 try:
 6......
 71390                         self.pid = _posixsubprocess.fork_exec( # 在fork_exec时使用这个pipe来接受err信息
 8......
 91395                                 errpipe_read, errpipe_write,
101396                                 restore_signals, start_new_session, preexec_fn)
11......
12
131518                 finally:   # 一个大try...catch处理中间的逻辑，最后finally确保关闭pipe的write fd
141519                     # be sure the FD is closed no matter what
151520                     os.close(errpipe_write)
16......
171536                 while True: # 读完pipe，读到空字符串或者超出bufsize后退出 (小声吐槽这个裸写的50000)
181537                     part = _eintr_retry_call(os.read, errpipe_read, 50000)
191538                     errpipe_data += part
201539                     if not part or len(errpipe_data) > 50000:
211540                         break

至于_eintr_retry_call没什么特别的，就是忽略 inpterrupt 调 os.read，返回结果而已。重点关注对 os.read pipe 的逻辑。

这个现场就很迷了，subprocess 是很经典的老库了，按理不会有什么奇怪问题。

分析

Subprocess 使用 pipe 来对接 Popen 子进程的输出，本次阻塞就出在读该 pipe 的地方。

回顾一下基本知识，pipe 本质上是一个带 buffer 的单项字符流，python 下的使用很简单，fdr, fdw = os.pipe()拿到 fds，然后在 fdw 写，在 fdr 读。

当 pipe 写端打开，buffer 为空时，读端os.read(fdr, size)阻塞。当os.close(fdw)关掉写端、buffer 空时，读端os.read(fdr, size)返回空字符串，通常，我们也会以此作为读端判断 pipe 关闭的标志。

那么，在 subprocess32 上述 1520 行关闭errpipe_write后，1536 行开始的循环会读完剩下的 buffer，然后遇到空字符串，这段逻辑就结束了。然而现在却卡在了os.read。

按以上事实，首先怀疑 fdw 没有 close 掉，要验证这一点，首先得拿到这个 fd，但问题是，这些变量都是 local 的，要拿到它们，必须先拿到对应的 frame，继续 Pylane Shell：

 1In [2]: import threading
 2In [3]: import sys
 3In [4]: threads = threading.enumerate()
 4In [5]: frames = sys._current_frames()
 5In [6]: for thread in threads:
 6   ...:     if thread.name == "pusher_keep_alive":
 7   ...:         break
 8   ...: else:
 9   ...:     ......
10In [7]: f = frames.get(thread.ident)

这段代码拿到了所有线程和所有 frames，然后找到阻塞的线程，用他的 ident 索引拿到对应的 frame

1In [8]: f.f_code
2Out[8]:
3<code object _eintr_retry_call at 0x7f82d6963a30, file "/usr/local/gpostd_virtualenv/local/lib/python2.7/site-packages/subprocess32.py", line 543>

按之前看过的代码，_eintr_retry_call没什么用，我们要看的是它的上一层。所以用 f_back 来 up 到上一层 frame：

1In [9]: ff = f.f_back
2In [10]: ff.f_code
3Out[10]:
4<code object _execute_child at 0x7f82d696ddb0, file "/usr/local/gpostd_virtualenv/local/lib/python2.7/site-packages/subprocess32.py", line 1330>

然后，我们可以访问这个 frame 的 locals，拿到这两个 fd 了：

1In [28]: ff.f_locals
2Out[28]:
3{......, 'errpipe_write': 19, ......, 'errpipe_read': 18, ......}

先尝试下os.read(errpipe_read, 16)，不出所料阻塞了，症结在此。lsof看看这两个 fd 到底被谁持有：

1lsof -d 18
2python2.7   912    root   18r     FIFO               0,10      0t0  20491606 pipe
3python2.7 25189    root   18r     FIFO               0,10      0t0  20491606 pipe
4
5lsof -d 19
6python2.7 25189    root   19w     FIFO               0,10      0t0  20491606 pipe

忽略其他的 fd，只看 FIFO 的，进程 912 和 25189 拿到了 18r，进程 25189 拿到了 19w。（r、w 即 read、write)

912 是 Python Agent 的主进程，25189 则是它 fork 的子进程。看到这，真相已昭然若揭了：

主进程的状态检查线程在执行这次 Popen，打开 18r 和 19w 这两个 fds 时，正好发生线程切换；
另外一个线程 fork 了子进程，18r 和 19w 理所当然地被子进程继承；
主进程切换回状态检查线程，继续执行 Popen 逻辑，关闭掉主进程持有的 19w，但子进程并未关闭 19w，所以这个 FIFO Pipe 并没有关闭；
最终，主进程的状态检查线程在 read 该 fd 时，本来应该在 fd 关闭后继续，但 fd 并没有关闭，所以线程阻塞。

实锤

我们可以写一个简单的例子来复现这个问题：

 1# -*- coding: utf-8 -*-
 2
 3import time
 4import os
 5from multiprocessing import Process
 6
 7print "pid", os.getpid()
 8
 9fdw, fdr = 0, 0
10
11
12def case_good():
13    """正常情况下，close掉pipe的fdw，fdr会读到''空字符串"""
14    print "case good"
15    fdr, fdw = os.pipe()
16    print "fdr", fdr, "fdw", fdw
17    os.close(fdw)
18    print "fdw closed, prepare to read"
19    r = os.read(fdr, 10)
20    if r == '':
21        print "fdr closed"
22    else:
23        print "impossible"
24
25
26def sub_process():
27    """子进程，sleep10秒后close fdw"""
28    time.sleep(10)
29    os.close(fdw)
30
31
32def case_bad():
33    """在持有fd时fork子进程，复现问题"""
34    print "case bad"
35    global fdw, fdr
36    # 父进程拿到pipe fd
37    fdr, fdw = os.pipe()
38    print "fdr", fdr, "fdw", fdw
39    # fork子进程
40    p = Process(target=sub_process, args=())
41    p.start()
42    print "sub process start", p.pid
43    time.sleep(2)
44
45    # 父进程close fdw
46    os.close(fdw)
47    print "fdw closed, prepare to read"
48    # 子进程也持有了fdw，所以上一句close并未关掉fdw，于是read会阻塞，直到子进程close fdw
49    r = os.read(fdr, 10)
50    if r == '':
51        print "fdr closed"
52    else:
53        print "impossible"
54
55
56case_good()
57case_bad()

运行它：

 1pid 77718
 2case good
 3fdr 3 fdw 4
 4fdw closed, prepare to read
 5fdr closed
 6
 7case bad
 8fdr 4 fdw 5
 9sub process start 77719
10fdw closed, prepare to read
11
12# 阻塞10秒后，子进程放开了那个fd，主进程终于可以继续了
13
14fdr closed

结论

这是一个比较典型的 bug，在非预期的上下文中 fork，让子进程继承了不必要的资源，总会引发各种莫名其妙的问题。改进的建议是，梳理清楚线程、子进程和其他功能的依赖，合理规划启动顺序，在尽可能干净的情况下 fork。

本文涉及的知识点扩展阅读：

pipe:
- www.ibm.com/developerwo…
- docs.python.org/3/library/o…
linux fork:
- man7.org/linux/man-p…
python stack frame:
- searchframe:****
  
  ****docs.python.org/3/reference…
- book.douban.com/subject/311…

▼ 点击成为社区注册会员 「在看」 一下，一起PY！