使用 Python 进行并行与高性能编程——使用多进程和 mpi4py 库在本章中，我们将开始学习并行编程。实际上，在

在本章中，我们将开始学习并行编程。实际上，在 Python 中，只有进程能够真正实现同时运行，从而带来并行计算的真正优势。Python 中有两种主要的实现进程并行编程的方法：标准库中的 multiprocessing 模块，以及扩展了 MPI 协议、适用于 Python 的 mpi4py 库。

multiprocessing 模块实现的是共享内存编程范式，程序中的进程可以访问一个公共的内存区域。和线程类似，这种方式可能会引发竞态条件和同步问题。针对此类问题，我们将介绍由 multiprocessing 模块提供的两种进程间通信通道：Queue 和 Pipe。它们允许并行进程之间完美同步地交换数据，避免共享内存带来的问题。

另一方面，mpi4py 库基于消息传递范式。在这种模型下，没有共享资源，进程间所有通信都通过消息交换完成。

本章结构

本章将涵盖以下主题：

multiprocessing 模块
进程作为类和子类
通信通道 — Queue 和 Pipe
进程池
ProcessPoolExecutor
mpi4py 库
点对点通信
集体通信
拓扑结构

进程与 `multiprocessing` 模块

在 Python 中，并行编程的真正主角是进程，只有进程才能实现代码的真正同时执行。这是因为，正如我们在前几章所看到的，Python 的线程无法实现真正的并行，最多只能实现并发。

因此，Python 中的进程编程非常重要，以至于标准库专门提供了一个模块来支持它——multiprocessing。

该模块提供了创建和管理程序中进程的所有可能操作，其用法与 threading 模块对线程的操作非常相似。实现中使用的指令和结构几乎一模一样：

使用 Process() 构造函数创建一个进程。
调用 start() 方法启动进程活动。
调用 join() 方法会让主程序（主进程）的执行等待，直到所有并行启动的进程执行完毕。

进程的生命周期基于三个状态：

Ready（就绪）
Running（运行）
Waiting（等待）

然后，multiprocessing 模块按照如下方式组织并行编程。主进程是程序开始执行时的那个进程。在主进程中，会用 Process() 构造函数定义若干个进程。此时，这些刚定义的进程处于就绪状态。之后，程序某部分开始并行执行。所有定义的子进程都会通过调用它们各自的 start() 方法同时激活。此时，主进程（也称父进程）将异步继续执行，不会等待子进程执行完成（见图 3.1）。

如果我们需要一种同步机制，并且希望主进程等待其子进程的结果，我们可以对每个子进程调用 join() 方法（见图 3.2）：

就像我们使用 threading 模块中的线程那样，使用进程时我们也以类似的方式编写代码。你可以看到，线程和进程在 API 设计上保持了良好的一致性。
每个并行激活的进程都会执行由目标函数指定的某些操作。所有进程可以执行相同的操作，也可以执行不同的操作。实际上，目标函数是通过构造函数中的 target 参数为每个进程单独指定的。

为了更好地理解刚才讲解的概念，我们来看下面这个示例代码：

import multiprocessing
import time

def function(i):
    print("start Process %i" % i)
    time.sleep(2)
    print("end Process %i" % i)
    return

if __name__ == '__main__':
    p1 = multiprocessing.Process(target=function, args=(1,))
    p2 = multiprocessing.Process(target=function, args=(2,))
    p3 = multiprocessing.Process(target=function, args=(3,))
    p4 = multiprocessing.Process(target=function, args=(4,))
    p5 = multiprocessing.Process(target=function, args=(5,))

    p1.start()
    p2.start()
    p3.start()
    p4.start()
    p5.start()

    p1.join()
    p2.join()
    p3.join()
    p4.join()
    p5.join()

    print("END Program")

正如代码中所示，定义了五个并行进程，通过 Process() 构造函数创建。它们都会执行同一个目标函数，这里为了简单起见称为 function。如果函数需要参数，可以通过元组形式传给 Process() 构造函数中的 args 参数。在代码里，我们传入了一个数字，用来区分函数内部运行的不同进程。所有这些进程都通过 start() 方法启动，然后通过 join() 方法与主进程同步。

运行这段代码，结果大致如下：

start Process 1
start Process 2
start Process 3
start Process 4
start Process 5
end Process 1
end Process 2
end Process 4
end Process 3
end Process 5
END Program

结果与使用线程和 threading 模块得到的结果非常相似，但这里五个进程是并行工作的。

像之前那样以这种方式实现并行代码非常清晰，每条指令都单独且明确地表达了。因为需要使用五个进程并行，所以写了五行构造函数，五行启动进程的 start 方法，以及五行用 join() 方法同步主进程。如果进程数量更多，按这种方式写代码就会变得非常麻烦。

实际上，有更高效的写法，可以利用循环等机制将之前的步骤泛化。修改后的代码示例如下：

import multiprocessing
import time

def function(i):
    print("start Process %i" % i)
    time.sleep(2)
    print("end Process %i" % i)
    return

if __name__ == '__main__':
    processes = []
    n_procs = 5
    for i in range(n_procs):
        p = multiprocessing.Process(target=function, args=(i,))
        processes.append(p)
        p.start()

    for i in range(n_procs):
        processes[i].join()

    print("END Program")

运行这段修改后的代码，结果和之前是完全一样的。

使用进程ID（PID）

在前面的例子中，我们通过 Process() 构造函数传递参数，把迭代器 i 传进去，以此来标识不同的进程。还有另一种方式来区分运行中的进程，那就是使用它们的进程ID（PID）。这操作非常简单，我们导入标准库中的 os 模块，调用其中的 getpid() 函数，即可获取当前运行进程的 PID。

我们对之前的代码做如下修改，用进程 ID 替代数字：

import multiprocessing
import os
import time

def function():
    pid = os.getpid()
    print("start Process %s" % pid)
    time.sleep(2)
    print("end Process %s" % pid)
    return

if __name__ == '__main__':
    processes = []
    n_procs = 5
    for i in range(n_procs):
        p = multiprocessing.Process(target=function)
        processes.append(p)
        p.start()
    for i in range(n_procs):
        processes[i].join()
    print("END Program")

可以看到，Process() 构造函数不再需要传递 args 参数，因为函数内部会自行调用 getpid() 返回当前进程的 PID。这样同一个函数被五个进程调用时，会分别返回五个不同的 PID，无需从主进程传参。

运行代码，输出类似如下：

start Process 20644
start Process 20000
start Process 16240
start Process 1988
start Process 24388
end Process 20644
end Process 20000
end Process 16240
end Process 1988
end Process 24388
END Program

从结果可以看出，这次打印的不是递增数字，而是进程的 PID，它们是操作系统分配给每个进程的唯一标识。通过这些 PID，可以使用系统的其他工具监控进程的资源占用和状态，独立于 Python 解释器。

进程池（Process Pool）

管理多个进程的编程模式的一个进一步发展是使用进程池，Python 中用 multiprocessing.Pool 类实现。

进程池是一个管理固定数量进程的对象。它负责控制进程的生命周期，包括创建、运行，甚至是否暂停某些进程以节省计算资源。multiprocessing.Pool 提供了一个接口，让我们专注于执行特定任务，而不必关心使用多少进程和具体哪个进程执行。这样可以大大简化代码，使代码更清晰易读。

我们用 multiprocessing.Pool 重写之前的代码：

import multiprocessing
import time

def function(i):
    process = multiprocessing.current_process()
    print("start Process %i(pid:%s)" % (i, process.pid))
    time.sleep(2)
    print("end Process %i(pid:%s)" % (i, process.pid))
    return

if __name__ == '__main__':
    pool = multiprocessing.Pool()
    print("Processes started: %s" % pool._processes)
    for i in range(pool._processes):
        results = pool.apply(function, args=(i,))
    pool.close()
    print("END Program")

代码中有几点变化：

pool = multiprocessing.Pool() 创建一个进程池，默认进程数量等于系统逻辑 CPU 核心数，可通过 pool._processes 获取。
用 pool.apply() 让进程池执行 function() 函数，传入任务编号作为参数。
任务执行完后调用 pool.close() 关闭进程池。
function() 函数内部通过 multiprocessing.current_process() 获取当前进程对象，调用其 pid 属性获得进程 ID。

运行该代码，结果示例：

Processes started: 12
start Process 0(pid:18196)
end Process 0(pid:18196)
start Process 1(pid:5300)
end Process 1(pid:5300)
...
start Process 11(pid:16500)
end Process 11(pid:16500)
END Program

可以看到，启动了12个进程，数量对应机器 CPU 核心数（不同电脑会不同）。每个进程顺序执行任务。注意，这里任务是顺序执行的，稍后会介绍如何用 map() 让它们并行执行。

进程池中的进程数量限制

调用 Pool() 时也可以指定固定的进程数，而不一定等于 CPU 核心数。用 processes 参数来指定：

pool = multiprocessing.Pool(processes=4)

我们修改代码如下，将进程池大小定为4，任务数为12：

import multiprocessing
import time

def function(i):
    process = multiprocessing.current_process()
    print("start Task %i(pid:%s)" % (i, process.pid))
    time.sleep(2)
    print("end Task %i(pid:%s)" % (i, process.pid))
    return

if __name__ == '__main__':
    pool = multiprocessing.Pool(processes=4)
    print("Processes started: %s" % pool._processes)
    for i in range(12):
        results = pool.apply(function, args=(i,))
    pool.close()
    print("END Program")

运行后结果类似：

Processes started: 4
start Task 0(pid:5284)
end Task 0(pid:5284)
start Task 1(pid:2220)
end Task 1(pid:2220)
...
start Task 11(pid:4692)
end Task 11(pid:4692)
END Program

可以看到，虽然任务有12个，但进程只有4个，这4个进程轮流执行所有任务，依然是顺序执行。

以子类方式定义进程

到目前为止，我们已经看到如何通过 Process() 构造函数定义并行执行的进程。通过传入目标函数（target），指定并行执行的代码。
另一种实现并行进程的方式是，将进程定义为 Process 的子类。在子类中，通过重写 __init__() 和 run() 方法来实现具体功能。并行运行的代码写在类的 run() 方法中。

下面通过一个示例代码来加深理解：

from multiprocessing import Process
import time
import random

class ChildProcess(Process):
    def __init__(self, count):
        Process.__init__(self)
        self.count = count

    def run(self):
        print("start Process %s" % self.count)
        time.sleep(2)
        print("end Process %s" % self.count)

if __name__ == '__main__':
    processes = []
    n_procs = 5
    for i in range(n_procs):
        p = ChildProcess(i)
        processes.append(p)
        p.start()

    for i in range(n_procs):
        processes[i].join()

从代码可以看出，我们定义了 Process 的子类 ChildProcess。
在类中重写了两个方法：

__init__() 用于定义子类的属性，比如这里接收 count 参数；
run() 方法中写入了原先目标函数中的代码，即进程启动时执行的代码。

运行结果类似于：

start Process 0
start Process 1
start Process 2
start Process 3
start Process 4
end Process 0
end Process 1
end Process 2
end Process 4
end Process 3

进程间的通信通道

进程和线程不同，通常不共享内存空间，因此需要不同的机制来实现相互通信和数据交换。multiprocessing 模块虽然基于共享内存范式，理论上允许进程间共享资源，但这样做会带来线程中存在的竞态条件和同步问题，因此强烈不建议使用。

multiprocessing 模块提供了和 threading 模块类似的同步工具，比如信号量（Semaphore）、锁（Lock）、事件（Event）等，但更推荐使用专门的通信通道来安全、同步地交换数据。相较于线程，共享内存和锁的复杂性更高，通信通道能更安全地在进程间传递数据。

通信通道主要有两种：

队列（Queue）
管道（Pipe）

这两种通道都由 multiprocessing 模块中的 Queue 和 Pipe 类实现。它们各自提供构造函数（Queue() 和 Pipe()）以及一整套管理数据交换和同步的方法。通信通道一般在主进程中创建，作为多个子进程间数据同步的桥梁，保证数据交换安全且有序，不存在竞态风险。且当发送的数据量多于接收请求时，数据不会被覆盖，而是缓冲在通信通道内。

队列（Queue）

multiprocessing.Queue 是进程间通信的首选数据结构。需要注意，Python 还有一个 queue 模块（常用于线程间），但它们内部实现不同，multiprocessing.Queue 采用进程专用的消息传递机制，避免使用锁同步带来的开销。

队列是先进先出（FIFO）的数据结构，进程可通过 put() 方法往队列里放数据，通过 get() 方法获取数据。数据顺序严格按入队顺序保持。队列可以保存简单数值，也可以保存复杂对象。

默认队列大小无限制，避免数据溢出。如果想限制队列容量，可以在创建时通过 maxsize 参数指定：

queue = multiprocessing.Queue(maxsize=100)

在运行时，有时需要查看当前队列里有多少对象，可调用 qsize() 方法：

size = queue.qsize()

还可以通过 empty() 和 full() 方法判断队列是否为空或已满：

if queue.empty():
    # 队列为空
if queue.full():
    # 队列已满

生产者-消费者范式示例

为了更好理解队列的工作方式，可以用生产者-消费者模式。生产者进程生成数据放入队列，消费者进程从队列取数据使用。生产者和消费者一般异步运行，仅通过队列传递数据。生产者生成数据速度可能快于消费者消费速度，队列会缓存数据，不会丢失。

下面代码展示了两个进程（生产者和消费者）通过队列通信的简单示例：

from multiprocessing import Process, Queue
import time
import random

class Consumer(Process):
    def __init__(self, count, queue):
        Process.__init__(self)
        self.count = count
        self.queue = queue

    def run(self):
        for i in range(self.count):
            local = self.queue.get()
            time.sleep(2)
            print("consumer has used this: %s" % local)

class Producer(Process):
    def __init__(self, count, queue):
        Process.__init__(self)
        self.count = count
        self.queue = queue

    def request(self):
        time.sleep(1)
        return random.randint(0, 100)

    def run(self):
        for i in range(self.count):
            local = self.request()
            self.queue.put(local)
            print("producer has loaded this: %s" % local)

if __name__ == '__main__':
    queue = Queue()
    count = 5
    p1 = Producer(count, queue)
    p2 = Consumer(count, queue)
    p1.start()
    p2.start()
    p1.join()
    p2.join()

运行结果示例：

producer has loaded this: 55
producer has loaded this: 30
consumer has used this: 55
producer has loaded this: 60
producer has loaded this: 14
consumer has used this: 30
producer has loaded this: 97
consumer has used this: 60
consumer has used this: 14
consumer has used this: 97

从结果可以看出，生产者进程生成的数据被依次放入队列，消费者进程有序且同步地使用这些数据，数据无丢失也无覆盖。

管道（Pipes）

另一个用作进程间通信通道的类是 Pipe。管道实现了两个进程之间的双向通信，设计用于在它们之间发送和接收数据。与 Queue 类似，Pipe 也有自己的构造函数 Pipe()，调用时会创建两个不同的 multiprocessing.connection.Connection 对象。通常，调用时会分别接收这两个连接对象：

conn1, conn2 = multiprocessing.Pipe()

这两个连接中，一个用来发送数据，另一个用来接收数据。默认情况下，第一个连接（conn1）只用于接收数据，第二个连接（conn2）只用于发送数据。

除了这种单向模式，Pipe 还可以创建为双工（duplex）模式，在该模式下两个连接都可以同时发送和接收数据。这时需要在构造函数中将 duplex 参数设为 True：

conn1, conn2 = multiprocessing.Pipe(duplex=True)

无论哪种模式，进程通过这两个生成的 Connection 对象使用 send() 方法发送数据，使用 recv() 方法接收数据：

conn2.send(object)
object = conn1.recv()

与 Queue 类似，发送的对象大小无限制，唯一要求是对象必须是可序列化（可 pickled）的。发送时自动序列化，接收时自动反序列化。

管道类也提供一些管理方法，比如 poll()，它会返回布尔值，表示是否有数据等待被接收：

if conn1.poll():
    # 有数据可读

该方法非常适合用来控制程序执行流程。

示例代码：使用管道实现生产者-消费者

下面用示例代码展示前面介绍的概念。代码结构与之前用队列实现的生产者-消费者非常相似，方便比较两者区别：

from multiprocessing import Process, Pipe
import time
import random

class Consumer(Process):
    def __init__(self, count, conn):
        Process.__init__(self)
        self.count = count
        self.conn = conn

    def run(self):
        for i in range(self.count):
            local = self.conn.recv()
            time.sleep(2)
            print("consumer has used this: %s" % local)

class Producer(Process):
    def __init__(self, count, conn):
        Process.__init__(self)
        self.count = count
        self.conn = conn

    def request(self):
        time.sleep(1)
        return random.randint(0, 100)

    def run(self):
        for i in range(self.count):
            local = self.request()
            self.conn.send(local)
            print("producer has loaded this: %s" % local)

if __name__ == '__main__':
    recver, sender = Pipe()
    count = 5
    p1 = Producer(count, sender)
    p2 = Consumer(count, recver)
    p1.start()
    p2.start()
    p1.join()
    p2.join()
    recver.close()
    sender.close()

运行结果示例：

producer has loaded this: 0
producer has loaded this: 76
consumer has used this: 0
producer has loaded this: 28
producer has loaded this: 90
consumer has used this: 76
producer has loaded this: 75
consumer has used this: 28
consumer has used this: 90
consumer has used this: 75

可以看到，生产者生成的数据依次通过管道发送，消费者按序接收并使用数据。

关闭管道

使用完 Pipe 后，应调用两个连接的 close() 方法释放资源：

recver.close()
sender.close()

管道的限制

Pipe 最大的限制在于它只能在两个端点间工作，也就是说只能在两个进程之间通信。如果需要多个进程之间通信，就需要为每对进程建立一条管道。进程数量越多，管道的管理就越复杂，容易变得难以维护。

管道（Pipe）和队列（Queue）的比较

我们刚才看到，multiprocessing 模块提供了 Pipe 和 Queue 两种进程间通信的方案。那么，在并行进程间交换数据时，应该选哪一个呢？

乍一看，考虑到 Pipe 只能在两个进程之间使用，似乎 Queue 总是更好的选择。但事实并非如此。

Pipe 是一个更底层、更简单的类，这使得它在两个进程间传输数据时效率更高、速度更快。因此，如果进程通信是成对的，用 Pipe 优于 Queue。但如果数据交换涉及的进程较多，只有使用 Queue 才实际可行。

函数的并行映射（Mapping）——使用进程池

另一个并行编程的重要方面是函数的映射（map）。Python 中的 map() 函数可以对一个可迭代对象中的每个元素应用指定函数，并返回对应结果的列表。

举个例子：

import time
import math
import numpy as np

def func(value):
    result = math.sqrt(value)
    print("The value %s and the elaboration is %s" % (value, result))
    return result

if __name__ == '__main__':
    data = np.array([10, 3, 6, 1])
    results = map(func, data)
    for result in results:
        print("This is the result: %s" % result)

运行结果：

The value 10 and the elaboration is 3.1622776601683795
This is the result: 3.1622776601683795
The value 3 and the elaboration is 1.7320508075688772
This is the result: 1.7320508075688772
The value 6 and the elaboration is 2.449489742783178
This is the result: 2.449489742783178
The value 1 and the elaboration is 1.0
This is the result: 1.0

这个映射机制可以扩展到并行编程中。如果能让每个元素的函数计算并行执行，效率将大大提升。假设函数执行时间为 t，普通程序对 n 个元素依次计算耗时为 n * t，而并行程序可以用 n 个进程同时计算，耗时缩短到约 t（前提是系统有 n 个核或进程）。

multiprocessing.pool.Pool 的 map() 和 map_async()

multiprocessing 模块中的 Pool 类提供了并行映射的两种方法：

map()：同步阻塞方法，直到所有子进程完成任务才返回；
map_async()：异步非阻塞方法，主进程继续执行，不等待结果。

用例演示 map()：

import time
import math
import numpy as np
from multiprocessing.pool import Pool

def func(value):
    result = math.sqrt(value)
    print("The value %s and the elaboration is %s" % (value, result))
    time.sleep(value)
    return result

if __name__ == '__main__':
    with Pool() as pool:
        data = np.array([10, 3, 6, 1])
        results = pool.map(func, data)
        print("The main process is going on…")
        for result in results:
            print("This is the result: %s" % result)
    print("END Program")

运行示例：

The value 10 and the elaboration is 3.1622776601683795
The value 3 and the elaboration is 1.7320508075688772
The value 6 and the elaboration is 2.449489742783178
The value 1 and the elaboration is 1.0
The main process is going on…
This is the result: 3.1622776601683795
This is the result: 1.7320508075688772
This is the result: 2.449489742783178
This is the result: 1.0
END Program

用 map_async() 的示例，只需将 map() 替换为 map_async()，并调用 results.get() 获取结果：

import time
import math
import numpy as np
from multiprocessing.pool import Pool

def func(value):
    result = math.sqrt(value)
    print("The value %s and the elaboration is %s" % (value, result))
    time.sleep(value)
    return result

if __name__ == '__main__':
    with Pool() as pool:
        data = np.array([10, 3, 6, 1])
        results = pool.map_async(func, data)
        print("Main Process is going on…")
        for result in results.get():
            print("This is the result: %s" % result)
    print("END Program")

运行示例：

Main Process is going on…
The value 10 and the elaboration is 3.1622776601683795
The value 3 and the elaboration is 1.7320508075688772
The value 6 and the elaboration is 2.449489742783178
The value 1 and the elaboration is 1.0
This is the result: 3.1622776601683795
This is the result: 1.7320508075688772
This is the result: 2.449489742783178
This is the result: 1.0
END Program

结果比较

使用 map()，主进程会等待所有子进程完成，程序执行流程被阻塞，直到所有结果返回。只有当所有并行任务完成后，才打印 "The main process is going on..."。
使用 map_async()，主进程不会等待任务完成，立即继续执行，"Main Process is going on..." 会最先打印。

并行映射中的 chunksize 参数

map() 函数会对可迭代对象中的每个元素应用指定函数。如果可迭代对象元素很多，给每个元素都分配一个进程会非常低效。

更高效的做法是将可迭代对象划分成若干块（chunk），每个块包含一定数量的元素，然后将每块分配给一个进程。这样可以减少进程启动和管理的开销。

这可以通过给 map() 函数传递 chunksize 参数实现。

示例代码

我们修改之前的示例，增大输入数组的元素数量，并将 chunksize 设置为 4，表示每个进程处理 4 个元素：

import time
import math
import numpy as np
from multiprocessing.pool import Pool

def func(value):
    result = math.sqrt(value)
    print("The value %s and the elaboration is %s" % (value, result))
    time.sleep(value)
    return result

if __name__ == '__main__':
    with Pool() as pool:
        data = np.array([10, 3, 6, 1, 4, 5, 2, 9, 7, 3, 4, 6])
        results = pool.map(func, data, chunksize=4)
        print("The main process is going on…")
        for result in results:
            print("This is the result: %s" % result)
    print("END Program")

运行结果示例

The value 10 and the elaboration is 3.1622776601683795
The value 4 and the elaboration is 2.0
The value 7 and the elaboration is 2.6457513110645907
The value 5 and the elaboration is 2.23606797749979
The value 3 and the elaboration is 1.7320508075688772
The value 2 and the elaboration is 1.4142135623730951
The value 3 and the elaboration is 1.7320508075688772
The value 4 and the elaboration is 2.0
The value 9 and the elaboration is 3.0
The value 6 and the elaboration is 2.449489742783178
The value 6 and the elaboration is 2.449489742783178
The value 1 and the elaboration is 1.0
The main process is going on…
This is the result: 3.1622776601683795
This is the result: 1.7320508075688772
This is the result: 2.449489742783178
This is the result: 1.0
This is the result: 2.0
This is the result: 2.23606797749979
This is the result: 1.4142135623730951
This is the result: 3.0
This is the result: 2.6457513110645907
This is the result: 1.7320508075688772
This is the result: 2.0
This is the result: 2.449489742783178
END Program

其他方法

除了 map()，Pool 类还提供了行为类似的方法，比如 imap()、apply()，这些方法也都有异步版本，灵活满足不同场景需求。

ProcessPoolExecutor

我们还可以使用另一个类 ProcessPoolExecutor，它来自 concurrent.futures 模块，提供了类似进程池的机制。这个类也维护一个可重用进程池，用于执行特定操作，包括 map() 函数。

使用时，需要指定要执行的目标函数和要映射的可迭代对象。当所有任务完成后，必须关闭 ProcessPoolExecutor 以释放资源，关闭方法是调用 shutdown()。

不过，ProcessPoolExecutor 支持上下文管理器，因此推荐用 with 语句来管理：

with ProcessPoolExecutor() as executor:
    # 执行任务

当执行离开 with 代码块时，会自动调用 shutdown()，等同于：

try:
    executor = ProcessPoolExecutor()
finally:
    executor.shutdown()

这使得资源管理非常方便。

使用示例

下面的代码演示了如何用 ProcessPoolExecutor 并行调用 map()：

import time
import math
import numpy as np
from concurrent.futures import ProcessPoolExecutor

def func(value):
    result = math.sqrt(value)
    print("The value %s and the elaboration is %s" % (value, result))
    time.sleep(value)
    return result

if __name__ == '__main__':
    with ProcessPoolExecutor(10) as executor:
        data = np.array([10, 3, 6, 1])
        for result in executor.map(func, data):
            print("This is the result: %s" % result)
    print("END Program")

运行示例：

The value 10 and the elaboration is 3.1622776601683795
The value 3 and the elaboration is 1.7320508075688772
The value 6 and the elaboration is 2.449489742783178
The value 1 and the elaboration is 1.0
This is the result: 3.1622776601683795
This is the result: 1.7320508075688772
This is the result: 2.449489742783178
This is the result: 1.0
END Program

chunksize 参数示例

类似 Pool 的 map()，ProcessPoolExecutor 的 map() 也支持 chunksize 参数，用于指定每个进程处理多少个元素。

下面示例演示了这个特性，并且为了显示哪个进程在处理元素，导入了 os 模块用 getpid() 获取当前进程 ID：

import time
import math
import os
import numpy as np
from concurrent.futures import ProcessPoolExecutor

def func(value):
    result = math.sqrt(value)
    pid = os.getpid()
    print("[Pid:%s] The value %s and the elaboration is %s" % (pid, value, result))
    time.sleep(value)
    return result

if __name__ == '__main__':
    with ProcessPoolExecutor(10) as executor:
        data = np.array([10, 3, 6, 1, 4, 5, 2, 9, 7, 3, 4, 6])
        for result in executor.map(func, data, chunksize=4):
            print("This is the result: %s" % result)
    print("END Program")

运行示例：

[Pid:14716] The value 10 and the elaboration is 3.1622776601683795
[Pid:22496] The value 4 and the elaboration is 2.0
[Pid:6508] The value 7 and the elaboration is 2.6457513110645907
[Pid:22496] The value 5 and the elaboration is 2.23606797749979
[Pid:6508] The value 3 and the elaboration is 1.7320508075688772
[Pid:22496] The value 2 and the elaboration is 1.4142135623730951
[Pid:14716] The value 3 and the elaboration is 1.7320508075688772
[Pid:6508] The value 4 and the elaboration is 2.0
[Pid:22496] The value 9 and the elaboration is 3.0
[Pid:14716] The value 6 and the elaboration is 2.449489742783178
[Pid:6508] The value 6 and the elaboration is 2.449489742783178
[Pid:14716] The value 1 and the elaboration is 1.0
This is the result: 3.1622776601683795
This is the result: 1.7320508075688772
This is the result: 2.449489742783178
This is the result: 1.0
This is the result: 2.0
This is the result: 2.23606797749979
This is the result: 1.4142135623730951
This is the result: 3.0
This is the result: 2.6457513110645907
This is the result: 1.7320508075688772
This is the result: 2.0
This is the result: 2.449489742783178
END Program

mpi4py 库

在 Python 中使用进程进行并行编程，有一种完全不同的思路，那就是使用消息传递接口（Message Passing Interface，简称 MPI）。MPI 是一种通信协议，几乎成为多节点（多处理器）系统并行编程的标准。这个协议已经被多种编程语言实现，例如 Fortran 和 C。

Python 也有若干 MPI 模块，其中最有代表性的是 mpi4py 库。它基于 MPI-1/2 规范开发，提供了一个基于 MPI-2 C++ 绑定的面向对象接口。

mpi4py 的设计思想与我们前面用 multiprocessing 模块的传统 Python 标准库完全不同。它基于专门针对多处理器、分布式架构及集群开发的标准，因此代码结构和执行方式都大不相同。

为什么使用 MPI 而不是 multiprocessing？

MPI 是许多编程语言的并行编程事实标准，包括 Python；
MPI 适应各种架构，从多核单机到复杂分布式系统和集群；
在单机多核环境中效率提升可能不明显，但 MPI 非常适合开发可扩展到复杂架构的并行程序。

安装和运行

安装 mpi4py：

使用 Anaconda：
```
conda install mpi4py
```
使用 pip：
```
pip install mpi4py
```

使用时，不直接用 python 命令，而用 mpiexec 启动多进程：

mpiexec -n x python mpi4py_name.py

其中，x 是需要启动的并行进程数。

简单示例

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

print("The process %d is started" % rank)

假设文件名为 mpi01.py，启动两个进程：

mpiexec -n 2 python mpi01.py

输出：

The process 0 is started
The process 1 is started

启动六个进程：

mpiexec -n 6 python mpi01a.py

输出示例：

The process 2 is started
The process 5 is started
The process 1 is started
The process 4 is started
The process 3 is started
The process 0 is started

说明

输出的进程编号（rank）顺序不是固定的，每次运行可能不同；
comm 是通信器（communicator），定义了一组可以通过消息传递通信的进程：
```
comm = MPI.COMM_WORLD
```
rank 是当前进程在通信组中的编号，所有由 mpiexec 启动的进程都属于该组，且每个进程有唯一的 rank 标识。

进程的并行执行性能

正如我们用 multiprocessing 模块看到的，进程允许代码并行执行。我们同样可以用 mpi4py 库做类似测试，并通过测量执行时间进行性能基准测试。

下面是测试代码：

import time
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

print("The process %d is started" % rank)
time.sleep(10)
print("The process %d is ended" % rank)

执行与时间测量

由于 mpiexec 启动的程序，我们无法直接用 Python 的 time 模块测量整体时间，但可以用 Linux 的 time 命令，或者 Windows PowerShell 的 Measure-Command 命令。

示例（PowerShell）：

Measure-Command { mpiexec -n 4 python mpi01b.py | Out-Host }

输出示例：

The process 2 is started
The process 2 is ended
The process 3 is started
The process 3 is ended
The process 1 is started
The process 1 is ended
The process 0 is started
The process 0 is ended
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 10
Milliseconds      : 298
Ticks             : 102985516
TotalDays         : 0,000119196199074074
TotalHours        : 0,00286070877777778
TotalMinutes      : 0,171642526666667
TotalSeconds      : 10,2985516
TotalMilliseconds : 10298,5516

结果显示，四个并行进程总共运行了约 10 秒，符合预期（每个进程睡眠 10 秒，同时并行执行）。

顺序执行对比

顺序执行版本代码：

import time

for i in range(4):
    print("The process %d is started" % i)
    time.sleep(10)
    print("The process %d is ended" % i)

用 PowerShell 计时：

Measure-Command { python mpi01c.py }

输出示例：

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 40
Milliseconds      : 99
Ticks             : 400994538
TotalDays         : 0,000464114048611111
TotalHours        : 0,0111387371666667
TotalMinutes      : 0,66832423
TotalSeconds      : 40,0994538
TotalMilliseconds : 40099,4538

可以看到顺序执行耗时约 40 秒（4 次 10 秒任务依次执行），明显比并行执行的 10 秒慢了 4 倍。

总结

通过这个简单的实验，我们直观地看到并行执行进程能大幅提升效率，尤其是在任务耗时且可独立并行的场景下，优势非常明显。

基于处理器/核心数量的并行效率

关于基于处理器或核心数量的并行效率，可以通过改变并行进程数量来做不同时间的测量。以下是测试代码：

import time
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

print("The process %d is started" % rank)
n = 0
for i in range(100000000):
    n += i
print("The process %d is ended" % rank)

我们用 mpiexec 运行该代码，改变并行进程数，并记录执行时间。结果如下表（表 3.1）：

进程数 N	总执行时间 t (秒)	每个进程平均执行时间 s = t / N (秒)
1	10.7	10.7
2	13.8	6.9
4	16.0	4
6	19.9	3.317
8	27.6	3.45
12	39.6	3.3

我们关注的关键指标是比例

$s = \frac{t}{N}$

即每个进程实际执行任务的平均时间。

从表中可以看到，随着进程数增加，ss 值逐渐减小，但它趋向于一个阈值（约 3.3 秒），不会继续降低。

这种趋势在下图（图 3.3）中表现得更加明显：

（此处应为图表示意图，反映 s 随 N 增加逐渐趋于稳定）

说明

该现象体现了并行效率随进程数增长的瓶颈效应；
除了并行计算本身的加速外，还存在通信、资源争用等开销，限制了进一步的效率提升；
实际应用中需要根据硬件和任务特性合理选择并行度。

从图 3.3 可以看出：

当进程数超过 6 个时，每个任务的执行时间几乎保持不变。这是因为我的系统有 6 个核心，超过核心数后，进程无法真正并行执行。

mpi4py 的主要应用

mpi4py 库基于 MPI 标准的消息传递机制，它在并行编程中的应用可以分为三类：

点对点通信（Point-to-point communication）
集体通信（Collective communication）
拓扑结构（Topologies）

本章后半部分将通过一系列示例来介绍这三种应用。

点对点通信实现

点对点通信指两个进程之间的消息交换。理想情况下，每个发送操作应该与对应的接收操作完美同步，但实际上并非如此。MPI 实现必须保证即使发送和接收进程不同步，发送的数据也不会丢失。通常会使用缓冲区，这对开发者是透明的，由 mpi4py 库全权管理。

mpi4py 模块通过以下两个函数支持点对点通信：

comm.send(data, dest=receiver_process) —— 发送数据给目标进程（通过进程 rank 指定）
comm.recv(source=sender_process) —— 从指定进程接收数据（通过进程 rank 指定）

示例代码

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

print("Process %s started" % rank)

if rank == 0:
    msg = "This is my message"
    receiver = 1
    comm.send(msg, dest=receiver)
    print("Process 0 sent: %s to %d" % (msg, receiver))

if rank == 1:
    source = 0
    msg = comm.recv(source=source)
    print("Process 1 received: %s from %d" % (msg, source))

运行方式

使用以下命令并行运行：

mpiexec -n 9 python programma01.py

注意事项

send() 和 recv() 方法是阻塞的，也就是说调用它们的进程会被阻塞，直到缓冲区数据安全可用。

集体通信（Collective Communications）

常用的集体通信操作包括：

屏障同步（Barrier synchronization）
数据广播（Broadcasting data）
数据汇聚（Gathering data）
数据分散（Scattering data）
归约操作（Reduction operation）

使用数据广播的集体通信

在并行编程中，经常遇到需要在多个进程间共享某个变量的值，或将不同变量分配给各进程并分别计算部分结果的情况。为解决这类问题，通常采用树状通信结构（例如：进程 0 将数据发送给进程 1 和 2，进程 1 和 2 再分别发送给进程 3、4、5、6）。

涉及通信组内所有进程的通信称为集体通信。也就是说，集体通信必须基于多个进程，而非仅两个进程。

其中，广播（broadcast）是一种常见的集体通信方式，它指一个进程将相同的数据发送给组内所有其他进程（见图 3.4）。

从根进程（进程0）读取数据并广播给所有进程（包括根进程）

from mpi4py import MPI
import random

comm = MPI.COMM_WORLD
rank = comm.rank

if rank == 0:
    data = random.randint(1, 10)
else:
    data = None

data = comm.bcast(data, root=0)

if rank == 1:
    print("The square of %d is %d" % (data, data * data))
if rank == 2:
    print("Half of %d is %d" % (data, data / 2))
if rank == 3:
    print("Double of %d is %d" % (data, data * 2))

运行代码，结果类似：

$ mpiexec -n 4 python mpi04a.py
The square of 6 is 36
Half of 6 is 3
Double of 6 is 12

使用数据分散（scatter）的集体通信

scatter 的功能和 broadcast 非常相似，但有一个区别：
comm.bcast() 会把相同的数据发送给所有进程；而 comm.scatter() 则会把一个数组中的不同部分分发给不同进程（见图 3.5）。

使用 comm.scatter() 进行数据分散

comm.scatter() 会将数组中的元素根据进程的 rank 分发给对应的进程：第一个元素发送给进程 0，第二个元素发送给进程 1，以此类推。

示例代码：

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

if rank == 0:
    array = ['AAA', 'BBB', 'CCC', 'DDD']
else:
    array = None

data = comm.scatter(array, root=0)
print("Process %d is working on %s element" % (rank, data))

运行结果示例：

$ mpiexec -n 4 python mpi05a.py
Process 0 is working on AAA element
Process 1 is working on BBB element
Process 2 is working on CCC element
Process 3 is working on DDD element

使用数据汇聚（gather）的集体通信

数据汇聚是分散的逆操作。在汇聚操作中，所有参与计算的进程会将它们各自处理的数据部分发送给一个指定的根进程，该根进程负责收集所有数据（见图 3.6）。

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

if rank == 0:
    data = 'AAA'
if rank == 1:
    data = 'BBB'
if rank == 2:
    data = 'CCC'
if rank == 3:
    data = 'DDD'

array = comm.gather(data, root=0)

if rank == 0:
    print("The new array is %s " % array)

运行结果示例：

$ mpiexec -n 4 python mpi06a.py
The new array is ['AAA', 'BBB', 'CCC', 'DDD']

这种消息传递方式非常适合并行计算。许多算法可以被拆分成多个相同的小任务，分配给不同的进程并行处理，有效缩短执行时间。计算结束时，多个部分结果汇聚起来形成最终结果。

汇聚操作示例：并行求和

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank
size = comm.size

data = 2 * rank + 1
print("Process %d calculated this value: %d" % (rank, data))

array = comm.gather(data, root=0)

if rank == 0:
    result = 0
    for i in range(size):
        result += array[i]
    print("The result of the parallel computation is %d" % result)

四进程运行示例：

$ mpiexec -n 4 python mpi06b.py
Process 3 calculated this value: 7
Process 1 calculated this value: 3
Process 2 calculated this value: 5
Process 0 calculated this value: 1
The result of the parallel computation is 16

可以看到，计算被均分为四部分，每个进程计算自己的部分。最后，根进程将所有结果加总得到最终结果。

进程数扩展示例

将并行进程数扩展到 8 个，结果如下：

$ mpiexec -n 8 python mpi06b.py
Process 5 calculated this value: 11
Process 7 calculated this value: 15
Process 3 calculated this value: 7
Process 6 calculated this value: 13
Process 1 calculated this value: 3
Process 2 calculated this value: 5
Process 4 calculated this value: 9
Process 0 calculated this value: 1
The result of the parallel computation is 64

使用 AlltoAll 模式的集体通信

另一种集体通信方式是 AlltoAll 模式。在此模式中，进程行为类似于同时执行汇聚（gather）和分散（scatter）。每个进程向通信组内的其他进程发送并接收数据块（见图 3.7）。

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

if rank == 0:
    output = ['0A', '0B', '0C', '0D']
if rank == 1:
    output = ['1A', '1B', '1C', '1D']
if rank == 2:
    output = ['2A', '2B', '2C', '2D']
if rank == 3:
    output = ['3A', '3B', '3C', '3D']

input = comm.alltoall(output)
print("Process %s received %s" % (rank, input))

运行代码，结果示例：

$ mpiexec -n 4 python mpi07a.py
Process 0 received ['0A', '1A', '2A', '3A']
Process 2 received ['0C', '1C', '2C', '3C']
Process 3 received ['0D', '1D', '2D', '3D']
Process 1 received ['0B', '1B', '2B', '3B']

结果说明

如结果所示，alltoall 实现了各进程之间的完全交换：

每个进程将自己的数组元素分别发送给所有其他进程的对应位置；
同时，每个进程从所有其他进程那里接收到对应位置的数据，组成自己的新数组。

例如，进程 0 收到的是所有进程数组中的第一个元素组成的新数组：['0A', '1A', '2A', '3A']。

归约操作（Reduction Operation）

归约操作由 comm.reduce() 执行，它从每个进程接收一组输入元素，经过归约计算后将结果返回给根进程。输出元素即为归约后的结果。

示例代码：

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.rank

if rank == 0:
    output = np.array([0, 0, 0, 0])
if rank == 1:
    output = np.array([1, 1, 1, 1])
if rank == 2:
    output = np.array([2, 2, 2, 2])
if rank == 3:
    output = np.array([3, 3, 3, 3])

print("Process %d. Sending %s" % (rank, output))

input = comm.reduce(output, root=0, op=MPI.SUM)

if rank == 0:
    print("The result of the parallel computation is %s" % (input))

运行结果示例：

$ mpiexec -n 4 python mpi08a.py
Process 3. Sending [3 3 3 3]
Process 2. Sending [2 2 2 2]
Process 1. Sending [1 1 1 1]
Process 0. Sending [0 0 0 0]
The result of the parallel computation is [6 6 6 6]

可用的归约操作列表

操作名	描述
SUM	元素求和
PROD	元素相乘
MAX	元素最大值
MAXLOC	元素最大值及对应进程号
MIN	元素最小值
MINLOC	元素最小值及对应进程号
LAND	元素逻辑与
LOR	元素逻辑或
BAND	元素按位与
BOR	元素按位或

表 3.2：归约操作完整列表

通过拓扑优化通信

MPI 提供了一个有趣的特性——虚拟拓扑（virtual topology）。所有进程间的通信都是基于通信组（communicator）进行的。使用 MPI_COMM_WORLD 时，所有进程都属于同一个组，每个进程拥有从 0 到 n-1 的唯一编号（rank），其中 n 是并行启动的进程数。

mpi4py 允许我们为这个通信器定义虚拟拓扑，也就是说，可以重新定义 rank 分配方式。定义虚拟拓扑后，每个节点（进程）只和其拓扑邻居节点通信，从而减少通信路径，提高通信效率。

在复杂场景中，rank 通常是随机分配的，消息可能会经过多个进程转发才能达到目的地，增加了执行时间。虚拟拓扑能有效解决这个问题。

MPI 支持的两种拓扑结构构建方式

笛卡尔拓扑（Cartesian）
邻接矩阵拓扑（Adjacency matrix）

笛卡尔拓扑常用于构造环形、环面等常见拓扑结构。

笛卡尔拓扑示例

下面代码创建了一个 3×3×3 的三维笛卡尔拓扑：

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

comm_3D = comm.Create_cart(dims=[3, 3, 3],
                           periods=[False, False, False],
                           reorder=False)

xyz = comm_3D.Get_coords(rank)
print("In this 3D topology, process %s has coordinates %s" % (rank, xyz))

运行示例及结果

假设运行 27 个进程：

$ mpiexec -n 27 python mpi09.py

输出示例：

In this 3D topology, process 0 has coordinates [0, 0, 0]
In this 3D topology, process 1 has coordinates [0, 0, 1]
In this 3D topology, process 4 has coordinates [0, 1, 1]
In this 3D topology, process 6 has coordinates [0, 2, 0]
In this 3D topology, process 7 has coordinates [0, 2, 1]
In this 3D topology, process 8 has coordinates [0, 2, 2]
In this 3D topology, process 12 has coordinates [1, 1, 0]
In this 3D topology, process 10 has coordinates [1, 0, 1]
In this 3D topology, process 9 has coordinates [1, 0, 0]
In this 3D topology, process 11 has coordinates [1, 0, 2]
In this 3D topology, process 13 has coordinates [1, 1, 1]
In this 3D topology, process 14 has coordinates [1, 1, 2]
In this 3D topology, process 15 has coordinates [1, 2, 0]
In this 3D topology, process 16 has coordinates [1, 2, 1]
In this 3D topology, process 17 has coordinates [1, 2, 2]
In this 3D topology, process 20 has coordinates [2, 0, 2]
In this 3D topology, process 5 has coordinates [0, 1, 2]
In this 3D topology, process 2 has coordinates [0, 0, 2]
In this 3D topology, process 18 has coordinates [2, 0, 0]
In this 3D topology, process 24 has coordinates [2, 2, 0]
In this 3D topology, process 26 has coordinates [2, 2, 2]
In this 3D topology, process 21 has coordinates [2, 1, 0]
In this 3D topology, process 3 has coordinates [0, 1, 0]
In this 3D topology, process 22 has coordinates [2, 1, 1]
In this 3D topology, process 19 has coordinates [2, 0, 1]
In this 3D topology, process 25 has coordinates [2, 2, 1]
In this 3D topology, process 23 has coordinates [2, 1, 2]

说明

该结果可用于绘制进程拓扑结构图（见图 3.9），将进程按其三维坐标可视化，便于理解和优化通信路径。

查询相邻进程并直接通信

对于一个给定的进程，可以用如下函数获取其在笛卡尔拓扑中相邻的进程：

shift(direction, displacement)

该函数返回两个整数，表示指定方向上（Cartesian轴）与该进程相邻的两个进程的rank编号。参数说明：

direction：要查询的方向，对应笛卡尔网格的轴，范围是 [0, n-1]，其中 n 是笛卡尔网格的维度数。根据前面例子，0 表示 Z 轴，1 表示 X 轴，2 表示 Y 轴。
displacement：移动距离（整数），> 0 表示正方向，< 0 表示反方向。在查询邻居时，一般取 1。

代码示例

继续用前面构造的 3×3×3 笛卡尔拓扑为例，查询进程 12 的邻居：

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank

comm_3D = comm.Create_cart(dims=[3, 3, 3],
                           periods=[False, False, False],
                           reorder=False)

xyz = comm_3D.Get_coords(rank)

if rank == 12:
    print("In this 3D topology, process %s has coordinates %s" % (rank, xyz))

    right, left = comm_3D.Shift(0, 1)       # Z 轴方向
    up, down = comm_3D.Shift(1, 1)          # X 轴方向
    forward, backward = comm_3D.Shift(2, 1) # Y 轴方向

    print("Neighbors (left-right): %s %s" % (left, right))
    print("Neighbors (up-down): %s %s" % (up, down))
    print("Neighbors (forward-backward): %s %s" % (forward, backward))

运行结果示例：

$ mpiexec -n 27 python mpi09b.py
In this 3D topology, process 12 has coordinates [1, 1, 0]
Neighbors (left-right): 21 3
Neighbors (up-down): 9 15
Neighbors (forward-backward): -1 13

说明与应用

结果显示，进程 12 在三个轴向上分别有相邻进程的 rank。-1 表示该方向无邻居（边界进程）。
当并行计算中只需与邻居进程交换信息时，可以直接使用这些 rank，避免使用全局的集体通信。
可通过条件判断筛选需要通信的邻居进程，灵活高效。

邻接进程间直接通信示例

进程间可直接调用：

comm.send(data, neighbor)
comm.recv(neighbor)

在相邻进程间发送和接收数据，提升通信效率。

下面是你这段内容的中文翻译：

结论

本章结束后，你已经掌握了进行并行编程的基本工具。重点介绍了 multiprocessing 模块中父子进程机制的相关概念，并通过一系列示例进行了讲解。你也具备了将强大的 map() 函数功能扩展到并行系统的能力，从而显著提升程序的执行效率。

同时，我们还介绍了基于 MPI 标准的 mpi4py 库，它允许你并行启动一组父进程，并使用消息传递范式进行数据交换。

接下来的章节将深入探讨更高级的并行编程概念，介绍一系列进一步扩展并行能力的库和工具。

使用 Python 进行并行与高性能编程——使用多进程和 mpi4py 库

进程与 multiprocessing 模块

使用进程ID（PID）

进程池（Process Pool）

以子类方式定义进程

进程间的通信通道

队列（Queue）

管道（Pipes）

管道（Pipe）和队列（Queue）的比较

函数的并行映射（Mapping）——使用进程池

并行映射中的 chunksize 参数

ProcessPoolExecutor

mpi4py 库

进程的并行执行性能

基于处理器/核心数量的并行效率

mpi4py 的主要应用

点对点通信实现

集体通信（Collective Communications）

使用数据广播的集体通信

使用数据分散（scatter）的集体通信

使用数据汇聚（gather）的集体通信

使用 AlltoAll 模式的集体通信

归约操作（Reduction Operation）

通过拓扑优化通信

结论

参考资料

进程与 `multiprocessing` 模块