python从入门到精通-第11章: 性能优化 — 理解Python的快与慢第11章: 性能优化

第11章: 性能优化 — 理解Python的快与慢

Java/Kotlin 开发者习惯了 JVM 的 JIT 优化，Python 的性能模型完全不同。理解"为什么慢"比盲目优化更重要。

11.1 CPython 执行模型

Java/Kotlin 对比

// Java: 源代码 → 字节码 → JVM → JIT编译 → 机器码
// HotSpot JVM 的分层编译:
// 1. 解释执行（启动快）
// 2. C1 编译（客户端编译，快速优化）
// 3. C2 编译（服务端编译，深度优化）
public class Perf {
    public static int add(int a, int b) {
        return a + b;  // 编译后变成单条 ADD 指令
    }

    public static void main(String[] args) {
        // 前 10000 次调用: 解释执行
        // 调用计数器达到阈值后: C1 编译
        // 更热后: C2 编译（内联、逃逸分析、循环展开...）
        for (int i = 0; i < 1_000_000; i++) {
            add(i, i + 1);
        }
    }
}

// Kotlin/JVM: 同 Java，最终都走 HotSpot JIT
// Kotlin/Native: 直接编译为 LLVM IR → 机器码（无 JIT，AOT 编译）
fun add(a: Int, b: Int): Int = a + b

fun main() {
    // JVM 上: 与 Java 相同的 JIT 优化路径
    repeat(1_000_000) { add(it, it + 1) }
}

Python 实现

import dis

# === 1. 查看简单函数的字节码 ===
def add(a, b):
    return a + b

print("=== add(a, b) 字节码 ===")
dis.dis(add)
# 输出:
#   0           0 RESUME                   0
#               2 LOAD_FAST                0 (a)
#               4 LOAD_FAST                1 (b)
#               6 BINARY_ADD
#              10 RETURN_VALUE

# === 2. 查看带循环的函数字节码 ===
def sum_loop(n):
    total = 0
    for i in range(n):
        total += i
    return total

print("\n=== sum_loop(n) 字节码 ===")
dis.dis(sum_loop)
# 输出:
#   0           0 RESUME                   0
#
#   2           2 LOAD_CONST               1 (0)
#               4 STORE_FAST               1 (total)
#
#   4           6 LOAD_GLOBAL              1 (range)
#              14 LOAD_FAST                0 (n)
#              16 CALL                     1
#              26 GET_ITER
#         >>   28 FOR_ITER                14 (to 60)
#              32 STORE_FAST               2 (i)
#
#   5          34 LOAD_FAST                1 (total)
#              36 LOAD_FAST                2 (i)
#              38 BINARY_ADD
#              40 STORE_FAST               1 (total)
#              42 JUMP_BACKWARD           14 (to 28)
#
#   6     >>   60 LOAD_FAST                1 (total)
#              62 RETURN_VALUE

# === 3. 查看函数调用的字节码开销 ===
def direct_add(a, b):
    return a + b

def call_add(a, b):
    return direct_add(a, b)  # 多了一层函数调用

print("\n=== direct_add 字节码 ===")
dis.dis(direct_add)
print("\n=== call_add 字节码 ===")
dis.dis(call_add)
# call_add 多了 LOAD_GLOBAL + LOAD_FAST + CALL 指令
# 每次函数调用都要: 创建帧对象 → 压栈 → 执行 → 弹栈 → 销毁帧

# === 4. 函数调用开销实测 ===
import timeit

setup = """
def add(a, b):
    return a + b
"""

direct = "a + b"
func_call = "add(a, b)"

# 直接加法
t_direct = timeit.timeit(direct, setup="a, b = 1, 2", number=10_000_000)
# 函数调用加法
t_func = timeit.timeit(func_call, setup=setup + "; a, b = 1, 2", number=10_000_000)

print(f"直接加法:     {t_direct:.4f}s")
print(f"函数调用加法: {t_func:.4f}s")
print(f"函数调用开销: {t_func / t_direct:.1f}x 慢")
# 典型输出:
# 直接加法:     0.12s
# 函数调用加法: 0.85s
# 函数调用开销: 7.1x 慢
# 原因: 每次调用都要创建/销毁帧对象 (PyFrameObject)

# === 5. 字节码指令类型详解 ===
import dis

def demo_instructions():
    x = 42                    # LOAD_CONST → STORE_FAST
    y = [1, 2, 3]            # LOAD_CONST → BUILD_LIST → STORE_FAST
    z = x + 1                # LOAD_FAST → LOAD_CONST → BINARY_ADD → STORE_FAST
    result = func(z)         # LOAD_GLOBAL → LOAD_FAST → CALL → STORE_FAST
    return result             # LOAD_FAST → RETURN_VALUE

def func(v):
    return v * 2

print("=== demo_instructions 字节码 ===")
dis.dis(demo_instructions)

# 关键指令类型:
# LOAD_FAST    — 从局部变量数组加载（最快，索引访问）
# LOAD_GLOBAL  — 从全局字典查找（较慢，哈希查找）
# LOAD_CONST   — 从常量元组加载（最快）
# STORE_FAST   — 存储到局部变量数组
# STORE_NAME   — 存储到全局/类字典（较慢）
# BINARY_ADD   — 二元加法（调用 __add__）
# BINARY_OP    — 3.11+ 通用二元操作
# CALL         — 函数调用（3.11+ 替代 CALL_FUNCTION）
# COMPARE_OP   — 比较操作
# GET_ITER     — 获取迭代器
# FOR_ITER     — 迭代器下一步
# JUMP_BACKWARD — 反向跳转（循环）

核心差异

维度	JVM (Java/Kotlin)	CPython
字节码执行	解释 + JIT 编译为机器码	纯解释执行（3.13 实验性 JIT 除外）
热点优化	有（方法/循环计数器 → C1 → C2）	无（3.11+ 有轻量自适应解释器）
内联	支持（C2 激进内联）	不支持
逃逸分析	支持（标量替换、锁消除）	不支持
启动速度	慢（JVM 预热）	快
稳态性能	快（JIT 优化后）	慢（纯解释）
内存模型	堆 + 栈 + JIT 代码缓存	堆（引用计数）+ 栈
字节码格式	栈式 + 寄存器混合	纯栈式

常见陷阱

# 陷阱 1: 试图用"预热"让 Python 变快
def compute():
    total = 0
    for i in range(1_000_000):
        total += i
    return total

# 预热 1000 次 — 在 Python 中毫无意义！
for _ in range(1000):
    compute()
# Python 没有 JIT，预热不会让代码变快
# 正确: 用 C 扩展、Cython、或换算法

# 陷阱 2: 忽略属性访问开销
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

def distance(p):
    # 每次访问 p.x 都是一次字典查找
    # LOAD_FAST(p) → LOAD_ATTR(x)
    return (p.x ** 2 + p.y ** 2) ** 0.5

# 优化: 提取为局部变量
def distance_fast(p):
    x, y = p.x, p.y  # 一次字典查找，存入局部变量
    return (x ** 2 + y ** 2) ** 0.5

t1 = timeit.timeit('distance(p)', setup='from __main__ import distance, Point; p=Point(3,4)', number=1_000_000)
t2 = timeit.timeit('distance_fast(p)', setup='from __main__ import distance_fast, Point; p=Point(3,4)', number=1_000_000)
print(f"属性访问:   {t1:.4f}s")
print(f"局部变量:   {t2:.4f}s")
print(f"加速比:     {t1/t2:.2f}x")
# 局部变量访问 (LOAD_FAST) 比属性访问 (LOAD_ATTR) 快得多

何时使用

理解 CPython 模型是所有性能优化的基础
用 dis 模块分析热点代码的字节码开销
函数调用开销大：避免在热循环中频繁调用小函数
局部变量访问比全局/属性访问快：在热循环中提取为局部变量

11.2 3.11+ 专用自适应解释器 (PEP 659)

Java/Kotlin 对比

// JVM JIT: 编译热点代码为原生机器码
// 运行时收集类型反馈，做激进优化:
// - 方法内联
// - 逃逸分析
// - 循环展开
// - 分支预测
// - 锁消除
// 代价: 编译暂停、更多内存（代码缓存可达数百MB）

// JVM 有两种 JIT 编译器:
// C1 (Client Compiler): 快速编译，轻度优化
// C2 (Server Compiler): 慢速编译，深度优化
// Graal (新一代): 用 Java 写的编译器，可扩展

// Kotlin/JVM: 完全依赖 HotSpot JIT
// Kotlin/Native: AOT 编译，无运行时优化
// Kotlin/Wasm: 编译为 WebAssembly，由浏览器 JIT

Python 实现

# === 1. 自适应解释器工作原理 ===
# 3.10: 每次执行 BINARY_ADD 都要:
#   1. 检查操作数类型
#   2. 查找 __add__ 方法
#   3. 调用 __add__
#   4. 检查返回值类型
#   5. 处理错误情况

# 3.11+: 第一次用通用 BINARY_ADD，后续观察类型:
#   如果连续多次都是 int + int:
#   → 替换为 BINARY_ADD_INT（特化版本）
#   → 跳过类型检查和 __add__ 查找
#   → 直接执行 C 层面的整数加法

# 类比: 就像 JVM 的 inline caching
# 但不编译为机器码，只是换一个更快的字节码指令

def add_numbers(a, b):
    return a + b

# 第一次调用: 通用 BINARY_OP_ADD_INT (adaptive)
# 后续调用 (int+int): 特化为 BINARY_OP_ADD_INT (specialized)
# 如果突然传入 str: 去优化 (deoptimize)，回到通用版本

# === 2. 实际 benchmark: 类型稳定 vs 类型不稳定 ===
import timeit

# 类型稳定: 始终 int + int
def stable_add(n):
    total = 0
    for i in range(n):
        total += i  # int + int，类型稳定
    return total

# 类型不稳定: int + float 交替
def unstable_add(n):
    total = 0
    for i in range(n):
        if i % 2 == 0:
            total += i       # int + int
        else:
            total += float(i) # int + float，类型不稳定
    return total

N = 100_000
t_stable = timeit.timeit(lambda: stable_add(N), number=100)
t_unstable = timeit.timeit(lambda: unstable_add(N), number=100)

print(f"类型稳定:   {t_stable:.4f}s")
print(f"类型不稳定: {t_unstable:.4f}s")
print(f"稳定更快:   {t_unstable/t_stable:.2f}x")
# 自适应解释器对类型稳定的代码优化效果最好

# === 3. 3.11 提速实测: 热循环 ===
import timeit

def hot_loop():
    """3.11 自适应解释器对这种纯计算循环提速最大"""
    result = 0
    for i in range(1000):
        for j in range(100):
            result += i * j
    return result

# 在 Python 3.10 上运行: ~0.45s
# 在 Python 3.11 上运行: ~0.28s
# 在 Python 3.12 上运行: ~0.24s
# 提速约 25-60%，取决于代码模式

t = timeit.timeit(hot_loop, number=100)
print(f"热循环 100 次: {t:.4f}s")
print(f"平均每次: {t/100:.4f}s")

# === 4. 哪些代码受益最大 ===
# 1. 热循环中的类型稳定操作 (int+int, list.append 等)
# 2. 频繁调用的纯 Python 函数
# 3. 属性访问 (LOAD_ATTR 自适应)
# 4. 全局变量查找 (LOAD_GLOBAL 自适应)

# 哪些代码受益小:
# 1. I/O 操作 (瓶颈在系统调用)
# 2. C 扩展调用 (已经是机器码)
# 3. 类型频繁变化的代码
# 4. 只执行一次的代码

# === 5. inline caching 示意 ===
# 3.11 的 inline caching 工作流程:

# 第一次执行 x.name:
#   LOAD_ATTR 0 (name)  → 通用版本，查找 x.__dict__['name']
#   记录: x 的类型是 Foo，name 在 __dict__ 偏移量 3

# 第二次执行 x.name (x 仍是 Foo):
#   LOAD_ATTR 0 (name)  → 特化版本
#   检查 x 的类型 == Foo? 是 → 直接从偏移量 3 读取
#   跳过了字典哈希查找！

# 第三次执行 x.name (x 变成了 Bar):
#   LOAD_ATTR 0 (name)  → 特化版本
#   检查 x 的类型 == Foo? 否 → 去优化，回退通用版本
#   可能更新缓存为 Bar 的偏移量

class Foo:
    def __init__(self):
        self.name = "foo"
        self.value = 42

f = Foo()
# 第一次 f.name: 通用 LOAD_ATTR
# 后续 f.name: 特化 LOAD_ATTR（直接偏移量访问）

核心差异

维度	JVM JIT	Python 3.11+ 自适应
优化方式	编译为原生机器码	特化字节码指令
内联	支持（深度内联）	不支持
逃逸分析	支持（标量替换）	不支持
循环展开	支持	不支持
内存开销	较大（代码缓存可达数百MB）	极小（每个缓存条目几十字节）
速度提升	10-100x（vs 纯解释）	1.25-1.6x（vs 3.10）
去优化	支持（回退到解释）	支持（回退到通用字节码）
编译暂停	有（C2 编译可能停顿几十ms）	无（解释器内完成）

常见陷阱

# 陷阱: 以为自适应解释器能弥补算法差距
# 3.11 不会让 O(n²) 变成 O(n)

def slow_search(lst, target):
    """O(n) 搜索 — 即使有自适应优化也慢"""
    for item in lst:
        if item == target:
            return True
    return False

def fast_search(s, target):
    """O(1) 哈希查找 — 3.10 和 3.11 都快"""
    return target in s

# 自适应解释器让慢代码变快 25-60%
# 但好的算法可以让代码快 100-1000x
# 先选对算法，再考虑解释器优化

何时使用

升级到 Python 3.11+ 即可免费获得性能提升
对纯 Python 代码有效，对 C 扩展调用无帮助
保持类型稳定以获得最大收益
不要依赖自适应解释器弥补算法差距

11.3 3.13+ 实验性 JIT 编译器 (PEP 744)

Java/Kotlin 对比

// JVM JIT: 成熟的生产级 JIT
// - 30 年的持续优化
// - 多层编译 (C1/C2/Graal)
// - 深度优化: 内联、逃逸分析、循环展开、分支预测
// - 代码缓存管理
// - 去优化机制完善

// Kotlin/Native: AOT 编译
// - 编译时确定所有类型
// - 无运行时类型信息
// - 无 JIT，但编译时优化充分

Python 实现

# === 1. copy-and-patch JIT 技术 ===
# 传统 JIT (如 JVM):
#   源码 → IR → 优化 → 寄存器分配 → 机器码生成
#   机器码生成是最复杂的部分

# copy-and-patch:
#   预先编译好 LLVM IR 模板（包含"洞"）
#   运行时只需要把具体值"贴"进洞里
#   避开了最复杂的代码生成阶段
#   速度极快，适合解释器内联使用

# 类比: JVM JIT 像从零定制西装
#       copy-and-patch 像买成衣，只改袖长

# === 2. 启用方式 ===
# 方式 1: 环境变量
# $ PYTHON_JIT=1 python3.13 script.py

# 方式 2: 命令行参数
# $ python3.13 -X jit script.py

# 方式 3: 代码中启用
import sys
if sys.version_info >= (3, 13):
    # JIT 在 3.13 是实验性的
    # 默认关闭，需要手动启用
    pass

# === 3. 当前状态 (Python 3.13) ===
# - 实验性功能，默认关闭
# - 性能提升有限（约 5%）
# - 主要优化: 热循环中的简单操作
# - 未来方向: 更激进的优化、更大的提速空间

# === 4. 示意: JIT 对热循环的影响 ===
def compute():
    total = 0
    for i in range(1_000_000):
        total += i * i
    return total

# 无 JIT: 解释执行每条字节码
# 有 JIT (3.13): 热循环可能被编译为机器码
#   但当前实现还很初步，提升不大

# 注意: 3.13 JIT 是实验性的，不建议生产使用
# 同时 PEP 703 (no-GIL/free-threaded) 也是 3.13 实验特性
# 两者可以组合使用，但稳定性有待验证

核心差异

维度	JVM JIT	Python 3.13 JIT
成熟度	生产级（30年）	实验性
默认开启	是	否
编译策略	多层 (C1/C2)	copy-and-patch
优化深度	极深（内联、逃逸分析等）	浅（简单操作）
代码缓存	数百MB	极小
性能提升	10-100x	~5%（当前）
去优化	完善	基础

何时使用

关注但不依赖，等稳定后再用于生产
可以在测试环境体验: PYTHON_JIT=1
长期来看，Python JIT 是缩小与 JVM 性能差距的关键
当前 3.11+ 的自适应解释器已经提供了更显著的提升

11.4 profiling 工具链

Java/Kotlin 对比

// Java profiling 工具:
// 1. VisualVM — GUI，函数级分析，内存分析
// 2. JFR (Java Flight Recorder) — 低开销持续记录
// 3. async-profiler — 采样分析，火焰图
// 4. JMH — 微基准测试框架（类似 Python timeit）
// 5. Java Mission Control — 分析 JFR 数据

// 特点: 集成度高，零配置，JVM 自带
// async-profiler 可以 attach 到运行中的进程，零侵入

import java.util.concurrent.TimeUnit;
// JMH 微基准测试
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public class MyBenchmark {
    @Benchmark
    public void testMethod() {
        // 被测代码
    }
}

// Kotlin/JVM: 使用相同的 Java profiling 工具
// Kotlin/Native: 用 Instruments (macOS) 或 perf (Linux)

Python 实现

4.1 timeit: 微基准测试

import timeit

# === 基本用法 ===
# 测量代码片段执行时间
t = timeit.timeit('sum(range(1000))', number=10000)
print(f"sum(range(1000)) x10000: {t:.4f}s")

# === repeat: 多次运行取统计 ===
# repeat=5: 运行 5 轮，每轮 number=10000 次
times = timeit.repeat('sum(range(1000))', number=10000, repeat=5)
print(f"5 轮耗时: {[f'{t:.4f}' for t in times]}")
print(f"最小值: {min(times):.4f}s (最可信)")
print(f"最大值: {max(times):.4f}s")
print(f"平均值: {sum(times)/len(times):.4f}s")

# === 测量函数 ===
def my_func():
    return [i * 2 for i in range(1000)]

t = timeit.timeit(my_func, number=10000)
print(f"my_func x10000: {t:.4f}s")

# === setup: 初始化代码 ===
# setup 只执行一次，不计入测量时间
t = timeit.timeit(
    'data.sort()',
    setup='import random; data = [random.random() for _ in range(10000)]',
    number=1000
)
print(f"排序 10000 个元素 x1000: {t:.4f}s")

# === 命令行用法 ===
# $ python -m timeit -n 10000 -r 5 "sum(range(1000))"
# 10000 loops, best of 5: 25.3 usec per loop

# === timeit vs 手动计时 ===
import time

# 错误: 用 time.time()
start = time.time()
sum(range(1000))
elapsed = time.time() - start
# time.time() 精度低，受系统时间调整影响

# 正确: 用 time.perf_counter()
start = time.perf_counter()
sum(range(1000))
elapsed = time.perf_counter() - start
# perf_counter() 使用最高精度计时器，不受系统时间影响

# 正确: 用 timeit (自动多次运行取最小值)
t = timeit.timeit('sum(range(1000))', number=100000)
print(f"timeit: {t/100000*1e6:.2f} usec per call")

4.2 cProfile: 函数级分析

import cProfile
import pstats
import io

# === 示例代码: 模拟一个有性能问题的程序 ===
def slow_search(data, target):
    """O(n) 线性搜索"""
    for i, item in enumerate(data):
        if item == target:
            return i
    return -1

def fast_search(data, target):
    """O(1) 哈希查找"""
    data_set = set(data)
    return target in data_set

def process_data(data, targets):
    """处理数据: 对每个 target 执行搜索"""
    results = []
    for target in targets:
        # 故意用慢搜索来演示 profiling
        idx = slow_search(data, target)
        results.append(idx)
    return results

def load_data():
    """模拟加载数据"""
    import random
    return [random.randint(0, 100_000) for _ in range(10_000)]

def main():
    data = load_data()
    targets = [random.randint(0, 100_000) for _ in range(1000)]
    for _ in range(10):
        process_data(data, targets)

# === 方式 1: 命令行 ===
# $ python -m cProfile -s cumulative script.py
# $ python -m cProfile -s tottime script.py
# -s cumulative: 按累计时间排序（含子函数）
# -s tottime: 按函数自身时间排序（不含子函数）

# === 方式 2: 代码中使用 ===
profiler = cProfile.Profile()
profiler.enable()

main()  # 被分析的代码

profiler.disable()

# 打印到控制台
profiler.print_stats(sort='cumulative')

# === 方式 3: 精细控制输出 ===
s = io.StringIO()
ps = pstats.Stats(profiler, stream=s).sort_stats('tottime')
ps.print_stats(10)  # 只打印前 10 个函数
print(s.getvalue())

# === 方式 4: 分析特定函数 ===
profiler = cProfile.Profile()
profiler.enable()
process_data(list(range(10000)), list(range(1000)))
profiler.disable()

ps = pstats.Stats(profiler)
ps.strip_dirs()  # 去掉路径前缀
ps.sort_stats('tottime')
ps.print_stats(5)  # 前 5 个最耗时的函数

# 输出示例:
#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#     10000    0.350    0.000    0.350    0.000 demo.py:3(slow_search)
#         1    0.002    0.002    0.352    0.352 demo.py:12(process_data)
#         1    0.001    0.001    0.353    0.353 demo.py:20(main)
#         1    0.000    0.000    0.353    0.353 {built-in method builtins.exec}

# 解读:
# ncalls: 调用次数
# tottime: 函数自身耗时（不含子函数）— 找瓶颈看这个
# cumtime: 累计耗时（含子函数）— 找热点路径看这个
# percall: 平均每次调用耗时

4.3 line_profiler: 行级分析

# 安装: pip install line_profiler

# line_profiler 可以分析每一行代码的执行时间
# 比cProfile更精细，可以看到函数内部哪一行最慢

# 使用方式 1: 命令行
# $ kernprof -l -v script.py
# -l: 行级分析
# -v: 立即显示结果

# 使用方式 2: 代码中
# 注意: @profile 装饰器由 kernprof 提供，不需要 import

def example_for_line_profiler():
    """
    用 kernprof -l -v 运行此文件来查看行级分析结果
    """
    import random
    import math

    data = [random.random() for _ in range(10000)]  # 行 1

    total = 0.0                                       # 行 2
    for x in data:                                    # 行 3
        total += math.sqrt(x)                         # 行 4
        total += math.log(x + 0.001)                  # 行 5

    result = total / len(data)                        # 行 6
    return result                                     # 行 7

# kernprof 输出示例:
# Timer unit: 1e-06 s
#
# Total time: 0.015 s
# File: demo.py
# Function: example at line 1
#
# Line #      Hits         Time  Per Hit   % Time  Line Contents
# ==============================================================
#      1                                           def example():
#      2         1         12.0     12.0      0.1      data = [...]
#      3         1          1.0      1.0      0.0      total = 0.0
#      4     10000       3500.0      0.4     23.3      for x in data:
#      5     10000       7500.0      0.8     50.0          total += math.sqrt(x)
#      6     10000       4000.0      0.4     26.7          total += math.log(x+0.001)
#      7         1          2.0      2.0      0.0      result = total/len(data)
#
# 分析: math.log 占了 50% 的时间，math.sqrt 占了 23%
# 优化方向: 用 numpy 向量化替代逐个计算

# === line_profiler 代码内使用方式 ===
# 如果不想用 kernprof 命令行，可以在代码中直接使用:
from line_profiler import LineProfiler

def target_function():
    import math
    total = 0
    for i in range(100000):
        total += math.sin(i) * math.cos(i)
    return total

lp = LineProfiler()
lp_wrapper = lp(target_function)
lp_wrapper()

lp.print_stats()

4.4 py-spy: 采样分析

# 安装: pip install py-spy
# 特点: Rust 编写，零侵入，可以 attach 到运行中的进程

# === 1. 实时查看 top 函数 ===
# $ py-spy top --pid <PID>
# 类似 top 命令，实时显示 CPU 占用最高的函数

# === 2. 生成火焰图 ===
# $ py-spy record -o flamegraph.svg --pid <PID>
# $ py-spy record -o flamegraph.svg -- python script.py

# === 3. dump 当前调用栈 ===
# $ py-spy dump --pid <PID>
# 显示所有线程的当前调用栈

# === 4. 分析正在运行的程序 ===
# $ py-spy top --pid 12345
# 输出示例:
# Process ID: 12345
# Collecting samples...
#
# % own   % total  Function
#  45.2%   45.2%   slow_search (myapp.py:15)
#  20.1%   65.3%   process_data (myapp.py:25)
#  15.3%   80.6%   main (myapp.py:35)
#  10.0%   10.0%   load_data (myapp.py:10)
#   9.4%    9.4%   {built-in method builtins.sum}

# === vs Java async-profiler ===
# async-profiler: 采样 JVM 栈帧，支持 Java + JNI
# py-spy: 采样 Python 栈帧，支持纯 Python + C 扩展
# 两者都是低开销采样分析器

4.5 memray: 内存分析

# 安装: pip install memray
# Python 3.8+ 的内存分析工具（Bloomberg 开源）

# === 1. 命令行: 跟踪内存分配 ===
# $ memray run my_script.py
# $ memray flamegraph memray-my_script.12345.bin

# === 2. 代码中使用 ===
# 注意: memray 需要在命令行运行，不支持纯代码内分析
# 但可以用 API 方式:

def memory_heavy_function():
    """演示内存分配"""
    data = []
    for i in range(100000):
        data.append({'id': i, 'value': i * 2, 'name': f'item_{i}'})
    return data

# 命令行运行:
# $ memray run -o output.bin my_script.py
# $ memray stats output.bin
# $ memray flamegraph output.bin
# $ memray tree output.bin

# === 3. 跟踪内存泄漏 ===
# $ memray run --live my_script.py
# 实时显示内存分配情况

# === vs Java VisualVM 内存分析 ===
# Java: VisualVM 可以查看堆快照、对象分布、GC 活动
# Python: memray 提供类似功能
#         tracemalloc 提供更轻量的内存跟踪

4.6 tracemalloc: 轻量内存跟踪

import tracemalloc
import linecache

# === 1. 基本用法 ===
tracemalloc.start()

# ... 运行代码 ...
data = [list(range(1000)) for _ in range(1000)]

current, peak = tracemalloc.get_traced_memory()
print(f"当前内存: {current / 1024 / 1024:.2f} MB")
print(f"峰值内存: {peak / 1024 / 1024:.2f} MB")

tracemalloc.stop()

# === 2. 查找内存泄漏 ===
tracemalloc.start()

# 快照 1
snapshot1 = tracemalloc.take_snapshot()

# ... 可能泄漏的代码 ...
leaky = []
for i in range(100000):
    leaky.append([i] * 100)

# 快照 2
snapshot2 = tracemalloc.take_snapshot()

# 比较差异
top_stats = snapshot2.compare_to(snapshot1, 'lineno')

print("=== 内存增长 Top 10 ===")
for stat in top_stats[:10]:
    print(stat)

# === 3. 查看具体哪行分配了最多内存 ===
tracemalloc.start()
data = [list(range(1000)) for _ in range(1000)]

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

print("\n=== 内存分配 Top 5 ===")
for stat in top_stats[:5]:
    print(stat)
    # 输出: demo.py:42: size=8.0 MB, count=1000, average=8.2 KB

profiling 决策树

需要分析性能？
├── 微基准测试（对比两种实现）
│   └── timeit（自动多次运行，取最小值）
│       $ python -m timeit -n 10000 -r 5 "code_here"
│
├── 找函数级瓶颈
│   └── cProfile
│       $ python -m cProfile -s tottime script.py
│       或代码中: cProfile.Profile()
│
├── 找函数内哪一行慢
│   └── line_profiler
│       $ pip install line_profiler
│       $ kernprof -l -v script.py
│
├── 生产环境无侵入分析
│   └── py-spy
│       $ py-spy top --pid <PID>
│       $ py-spy record -o flame.svg -- python script.py
│
├── 内存分析
│   ├── 轻量跟踪 → tracemalloc（标准库）
│   └── 深度分析 → memray（第三方）
│       $ memray run -o output.bin script.py
│
└── 持续监控
    └── py-spy top（实时采样）
    或 memray live

核心差异

工具	用途	Java 对应物	侵入性
timeit	微基准测试	JMH	无
cProfile	函数级分析	VisualVM CPU	低
line_profiler	行级分析	JFR 事件	低
py-spy	采样分析（无侵入）	async-profiler	零
memray	内存分析	VisualVM 内存	低
tracemalloc	内存跟踪	-	低

常见陷阱

# 陷阱 1: 用 time.time() 做微基准测试
import time
start = time.time()  # 精度低，受系统时间调整影响
# ... code ...
elapsed = time.time() - start

# 正确: 用 time.perf_counter() 或 timeit
start = time.perf_counter()
# ... code ...
elapsed = time.perf_counter() - start

# 陷阱 2: 在 profiling 环境外优化
# 先 profile 找到真正的瓶颈，再优化！
# "过早优化是万恶之源" — Donald Knuth

# 陷阱 3: 微基准测试中忽略 GC 影响
# timeit 默认禁用 GC，如果测试涉及大量对象创建
# 需要手动启用: timeit.timeit(..., setup='gc.enable()')

# 陷阱 4: cProfile 本身有开销（约 2-5x 减速）
# 不要用 cProfile 的绝对时间做性能报告
# 只用它找相对瓶颈（哪个函数最慢）

何时使用

timeit: 对比两种实现的性能
cProfile: 找到程序的热点函数
line_profiler: 找到函数内最慢的代码行
py-spy: 生产环境无侵入分析
memray/tracemalloc: 内存问题排查

11.5 时间复杂度与 benchmark

Java/Kotlin 对比

// Java 的时间复杂度与 Python 基本一致
// 因为底层都是同样的数据结构:
// ArrayList ≈ Python list (动态数组)
// HashMap  ≈ Python dict (哈希表)
// HashSet  ≈ Python set  (哈希表)
// LinkedList ≈ 无直接对应 (Python 用 deque)

import java.util.*;

// Java 开发者同样需要注意:
// - ArrayList.contains() 是 O(n)，不是 O(1)
// - ArrayList.add(0, x) 是 O(n)，不是 O(1)
// - HashMap.get() 是 O(1) 均摊

Python 实现

5.1 完整时间复杂度表

# === dict/list/set 操作时间复杂度 ===
#
# | 操作                    | 时间复杂度    | 备注                    |
# |------------------------|-------------|------------------------|
# | list.append(x)         | O(1) 均摊   | 同 Java ArrayList.add  |
# | list.insert(0, x)      | O(n)        | 所有元素右移            |
# | list[i]                | O(1)        | 索引访问                |
# | list.pop()             | O(1)        | 弹出最后一个            |
# | list.pop(0)            | O(n)        | 所有元素左移            |
# | list.remove(x)         | O(n)        | 线性搜索 + 移动         |
# | x in list              | O(n)        | 线性搜索                |
# | list.sort()            | O(n log n)  | Timsort                 |
# | list.copy()            | O(n)        | 浅拷贝                  |
# |                        |             |                        |
# | dict[key]              | O(1) 均摊   | 哈希查找                |
# | dict.get(key)          | O(1) 均摊   | 哈希查找                |
# | dict[key] = val        | O(1) 均摊   | 哈希插入                |
# | del dict[key]          | O(1) 均摊   | 哈希删除                |
# | key in dict            | O(1) 均摊   | 哈希查找                |
# | dict遍历               | O(n)        |                        |
# |                        |             |                        |
# | set.add(x)             | O(1) 均摊   | 同 Java HashSet.add     |
# | x in set               | O(1) 均摊   | 哈希查找                |
# | set.remove(x)          | O(1) 均摊   |                        |
# | set & set              | O(min(n,m)) | 交集                    |
# | set | set              | O(n+m)      | 并集                    |

5.2 实际 benchmark: O(1) vs O(n)

import timeit

# === 1. set vs list 成员检查 ===
N = 10000

setup_list = f'data = list(range({N}))'
setup_set = f'data = set(range({N}))'

t_list = timeit.timeit('9999 in data', setup=setup_list, number=10000)
t_set = timeit.timeit('9999 in data', setup=setup_set, number=10000)

print(f"=== set vs list 成员检查 (n={N}) ===")
print(f"list:  {t_list:.4f}s")
print(f"set:   {t_set:.4f}s")
print(f"set 快 {t_list/t_set:.0f}x")
# 典型输出: set 快 500-1000x

# === 2. list.append vs list.insert(0) ===
N = 10000

t_append = timeit.timeit(
    'lst.append(i)',
    setup=f'lst = []; N = {N}',
    number=N
)

t_insert = timeit.timeit(
    'lst.insert(0, i)',
    setup=f'lst = []; N = {N}',
    number=N
)

print(f"\n=== list.append vs list.insert(0) (n={N}) ===")
print(f"append:      {t_append:.4f}s  (O(1))")
print(f"insert(0):   {t_insert:.4f}s  (O(n))")
print(f"insert(0) 慢 {t_insert/t_append:.0f}x")

# === 3. deque.appendleft vs list.insert(0) ===
from collections import deque

t_deque = timeit.timeit(
    'dq.appendleft(i)',
    setup=f'from collections import deque; dq = deque(); N = {N}',
    number=N
)

print(f"\n=== deque.appendleft vs list.insert(0) (n={N}) ===")
print(f"deque.appendleft: {t_deque:.4f}s  (O(1))")
print(f"list.insert(0):   {t_insert:.4f}s  (O(n))")
print(f"deque 快 {t_insert/t_deque:.0f}x")

5.3 str.join() vs + 拼接

import timeit

N = 10000

# === 1. str.join vs + 拼接 ===
t_join = timeit.timeit(
    "''.join(str(i) for i in range(N))",
    setup=f'N = {N}',
    number=100
)

t_plus = timeit.timeit(
    "s = ''; [s := s + str(i) for i in range(N)]",
    setup=f'N = {N}',
    number=100
)

print(f"=== str.join vs + 拼接 (n={N}) ===")
print(f"join: {t_join:.4f}s")
print(f"+:    {t_plus:.4f}s")
print(f"join 快 {t_plus/t_join:.0f}x")
# join 快 10-50x，因为:
# + 每次创建新字符串对象，O(n²) 总复杂度
# join 一次性分配内存，O(n) 总复杂度

# === 2. 不同规模下的差异 ===
for n in [100, 1000, 5000, 10000]:
    t_j = timeit.timeit(
        "''.join(str(i) for i in range(N))",
        setup=f'N = {n}', number=100
    )
    t_p = timeit.timeit(
        "s = ''; [s := s + str(i) for i in range(N)]",
        setup=f'N = {n}', number=100
    )
    print(f"n={n:5d}: join={t_j:.4f}s  +={t_p:.4f}s  +/join={t_p/t_j:.1f}x")
# 随着n增大，+ 的劣势越来越明显（O(n²) vs O(n)）

5.4 dict.get() vs try/except

import timeit

N = 100000

# === dict.get vs try/except (key 存在时) ===
setup = f"""
d = {{i: i*2 for i in range({N})}}
key = {N - 1}  # 一定存在
"""

t_get = timeit.timeit('d.get(key)', setup=setup, number=1_000_000)
t_try = timeit.timeit(
    'try:\n    d[key]\nexcept KeyError:\n    pass',
    setup=setup, number=1_000_000
)

print(f"=== dict.get vs try/except (key 存在) ===")
print(f"get():       {t_get:.4f}s")
print(f"try/except:  {t_try:.4f}s")
print(f"get() 快 {t_try/t_get:.2f}x")
# key 存在时: get() 略快（无异常处理开销）

# === key 不存在时 ===
setup_miss = f"""
d = {{i: i*2 for i in range({N})}}
key = -1  # 一定不存在
default = 0
"""

t_get_miss = timeit.timeit('d.get(key, default)', setup=setup_miss, number=1_000_000)
t_try_miss = timeit.timeit(
    'try:\n    v = d[key]\nexcept KeyError:\n    v = default',
    setup=setup_miss, number=1_000_000
)

print(f"\n=== dict.get vs try/except (key 不存在) ===")
print(f"get(key, default):  {t_get_miss:.4f}s")
print(f"try/except:         {t_try_miss:.4f}s")

# EAFP (Easier to Ask Forgiveness than Permission) 原则:
# 如果 key 大概率存在 → try/except（异常开销只在 miss 时发生）
# 如果 key 大概率不存在 → get()（避免异常处理开销）
# Python 风格: 优先 EAFP，除非 miss 率很高

5.5 locals() vs globals() 查找速度

import timeit

# === 局部变量 vs 全局变量查找 ===
t_local = timeit.timeit(
    'x',  # LOAD_FAST: 从局部变量数组索引访问
    setup='x = 42',
    number=10_000_000
)

t_global = timeit.timeit(
    'g',  # LOAD_GLOBAL: 从全局字典哈希查找
    setup='import __main__; __main__.g = 42',
    number=10_000_000
)

print(f"=== locals vs globals 查找 ===")
print(f"局部变量 (LOAD_FAST):  {t_local:.4f}s")
print(f"全局变量 (LOAD_GLOBAL): {t_global:.4f}s")
print(f"局部变量快 {t_global/t_local:.1f}x")
# LOAD_FAST 是数组索引，LOAD_GLOBAL 是字典查找
# 差距约 2-4x

# === 优化技巧: 在热循环中缓存全局函数为局部变量 ===
import math

def slow_version(n):
    total = 0
    for i in range(n):
        total += math.sqrt(i)  # 每次循环都 LOAD_GLOBAL math
    return total

def fast_version(n):
    sqrt = math.sqrt  # 缓存为局部变量
    total = 0
    for i in range(n):
        total += sqrt(i)  # LOAD_FAST，更快
    return total

N = 100000
t_slow = timeit.timeit(lambda: slow_version(N), number=10)
t_fast = timeit.timeit(lambda: fast_version(N), number=10)

print(f"\n=== 全局函数 vs 局部缓存 ===")
print(f"全局查找: {t_slow:.4f}s")
print(f"局部缓存: {t_fast:.4f}s")
print(f"局部缓存快 {t_slow/t_fast:.2f}x")

核心差异

操作	Python	Java	时间复杂度
list.append	O(1) 均摊	ArrayList.add	O(1) 均摊
list.insert(0, x)	O(n)	ArrayList.add(0, x)	O(n)
list[i]	O(1)	ArrayList.get	O(1)
list.pop()	O(1)	ArrayList.remove(last)	O(1)
list.pop(0)	O(n)	ArrayList.remove(0)	O(n)
dict[key]	O(1) 均摊	HashMap.get	O(1) 均摊
dict 插入顺序	O(1) 3.7+	LinkedHashMap	O(1)
set.add	O(1) 均摊	HashSet.add	O(1) 均摊
x in list	O(n)	ArrayList.contains	O(n)
x in set	O(1)	HashSet.contains	O(1)
str.join	O(n)	String.join	O(n)
str + str	O(n)	String + String	O(n)

常见陷阱

# 陷阱 1: 用 list 做成员检查
names = ["Alice", "Bob", "Charlie"]  # 假设有 10000 个
if "David" in names:  # O(n) — 线性搜索！
    pass

# 正确: 用 set
name_set = {"Alice", "Bob", "Charlie"}
if "David" in name_set:  # O(1) — 哈希查找！
    pass

# 陷阱 2: 频繁在 list 头部插入
items = []
for i in range(1000):
    items.insert(0, i)  # O(n) 每次插入！总 O(n²)

# 正确: 尾部插入后反转
items = []
for i in range(1000):
    items.append(i)  # O(1)
items.reverse()  # O(n) 总共 O(n)

# 或用 collections.deque
from collections import deque
items = deque()
for i in range(1000):
    items.appendleft(i)  # O(1)

# 陷阱 3: 在循环中用 + 拼接字符串
result = ""
for s in strings:
    result += s  # O(n²) — 每次创建新字符串

# 正确: 用 join
result = "".join(strings)  # O(n)

何时使用

成员检查: set > dict > list
频繁头插: deque > list
字符串拼接: join > + (大量拼接时)
缓存全局函数为局部变量: 在热循环中
EAFP vs LBYL: key 大概率存在用 try/except，否则用 get()

11.6 slots: 内存与速度

Java/Kotlin 对比

// Java: 对象内存布局由 JVM 决定
// 普通对象: 对象头(12-16字节) + 字段 + 对齐填充
// 无法像 __slots__ 一样精确控制字段
// 但 JVM 的 Escape Analysis 可以做标量替换:
// 如果对象不逃逸方法，JVM 可能将其拆解为标量，不分配在堆上

// Kotlin: data class 自动生成 equals/hashCode/toString/copy
// 但内存布局仍由 JVM 决定
data class Point(val x: Int, val y: Int)
// JVM 可能优化为 16 字节 (对象头 + 2个int + 对齐)

Python 实现

import sys
import tracemalloc
import timeit

# === 1. 基本对比 ===
class Point:
    """普通 Python 对象: 每个实例有一个 __dict__"""
    def __init__(self, x, y):
        self.x = x
        self.y = y

class SlotPoint:
    """使用 __slots__: 禁止 __dict__，固定属性"""
    __slots__ = ('x', 'y')
    def __init__(self, x, y):
        self.x = x
        self.y = y

p1 = Point(1, 2)
p2 = SlotPoint(1, 2)

print(f"=== 单个对象大小 ===")
print(f"普通对象:  {sys.getsizeof(p1)} bytes")
print(f"slots对象: {sys.getsizeof(p2)} bytes")
# 典型输出:
# 普通对象:  56 bytes
# slots对象: 48 bytes
# 注意: sys.getsizeof 只测量对象本身，不包括 __dict__
# 实际差距更大！

# === 2. 大量对象的内存对比 (tracemalloc) ===
N = 100000

# 测量普通对象
tracemalloc.start()
points1 = [Point(i, i) for i in range(N)]
current1, peak1 = tracemalloc.get_traced_memory()
tracemalloc.stop()

# 测量 slots 对象
tracemalloc.start()
points2 = [SlotPoint(i, i) for i in range(N)]
current2, peak2 = tracemalloc.get_traced_memory()
tracemalloc.stop()

print(f"\n=== {N} 个对象的内存占用 ===")
print(f"普通对象:  {peak1 / 1024 / 1024:.2f} MB")
print(f"slots对象: {peak2 / 1024 / 1024:.2f} MB")
print(f"节省:      {(1 - peak2/peak1) * 100:.1f}%")
# 典型输出: 节省 40-60%

# === 3. 属性访问速度对比 ===
setup_normal = f"""
from __main__ import Point
p = Point(3, 4)
"""

setup_slots = f"""
from __main__ import SlotPoint
p = SlotPoint(3, 4)
"""

t_normal = timeit.timeit('p.x', setup=setup_normal, number=10_000_000)
t_slots = timeit.timeit('p.x', setup=setup_slots, number=10_000_000)

print(f"\n=== 属性访问速度 ===")
print(f"普通对象 (__dict__查找): {t_normal:.4f}s")
print(f"slots对象 (描述符访问):  {t_slots:.4f}s")
print(f"slots 快 {t_normal/t_slots:.2f}x")
# slots 属性访问略快（描述符 vs 字典查找）

# === 4. 属性写入速度对比 ===
t_write_normal = timeit.timeit('p.x = 42', setup=setup_normal, number=10_000_000)
t_write_slots = timeit.timeit('p.x = 42', setup=setup_slots, number=10_000_000)

print(f"\n=== 属性写入速度 ===")
print(f"普通对象: {t_write_normal:.4f}s")
print(f"slots对象: {t_write_slots:.4f}s")
print(f"slots 快 {t_write_normal/t_write_slots:.2f}x")

# === 5. __slots__ 的继承行为 ===
class Base:
    __slots__ = ('x',)

class Child(Base):
    __slots__ = ('y',)  # 子类必须重新声明自己的 slots

c = Child()
c.x = 1  # OK — 继承自 Base
c.y = 2  # OK — 声明在 Child
# c.z = 3  # AttributeError! — 没有 __dict__，不能添加动态属性

# 如果子类不声明 __slots__，会自动获得 __dict__
class GrandChild(Child):
    pass  # 没有 __slots__

gc = GrandChild()
gc.x = 1
gc.y = 2
gc.z = 3  # OK! GrandChild 有 __dict__
# 但这样 GrandChild 就失去了 slots 的内存优势

# 正确做法: 每一层都声明 __slots__
class GrandChild2(Child):
    __slots__ = ('z',)

gc2 = GrandChild2()
gc2.x = 1
gc2.y = 2
gc2.z = 3
# gc2.w = 4  # AttributeError!

# === 6. __slots__ 与动态属性的冲突 ===
class User:
    __slots__ = ('name', 'age')

u = User()
u.name = "Alice"  # OK
u.age = 30        # OK
try:
    u.email = "a@b.com"  # AttributeError!
except AttributeError as e:
    print(f"动态属性报错: {e}")

# 如果需要动态属性，可以在 __slots__ 中加 __dict__
class FlexibleUser:
    __slots__ = ('name', 'age', '__dict__')  # 允许动态属性

fu = FlexibleUser()
fu.name = "Bob"
fu.email = "b@c.com"  # OK，存入 __dict__
# 但这样内存优势就没了

# === 7. __slots__ 与 pickle ===
import pickle

class Data:
    __slots__ = ('x', 'y')
    def __init__(self, x, y):
        self.x = x
        self.y = y

d = Data(1, 2)
serialized = pickle.dumps(d)
restored = pickle.loads(serialized)
print(f"\npickle 序列化/反序列化: x={restored.x}, y={restored.y}")
# 默认情况下 __slots__ 对象可以被 pickle
# 但如果 __slots__ 中有不可序列化的类型，需要自定义 __getstate__/__setstate__

核心差异

维度	普通对象	slots 对象
dict	有（动态属性）	无
内存	大	小 40-60%
属性访问	略慢（字典查找）	略快（描述符）
动态属性	支持	不支持
序列化	pickle 正常	需注意
weakref	自动支持	需显式声明

常见陷阱

# 陷阱 1: 忘记在子类中声明 __slots__
class Parent:
    __slots__ = ('x',)

class Child(Parent):
    pass  # 没有 __slots__ → 自动获得 __dict__ → 失去内存优势

# 正确: 子类也声明 __slots__
class Child(Parent):
    __slots__ = ('y',)

# 陷阱 2: __slots__ 对象默认不支持弱引用
import weakref
class Node:
    __slots__ = ('value',)

n = Node()
n.value = 42
try:
    weakref.ref(n)  # TypeError!
except TypeError as e:
    print(f"弱引用报错: {e}")

# 正确: 在 __slots__ 中加 __weakref__
class WeakNode:
    __slots__ = ('value', '__weakref__')

wn = WeakNode()
wn.value = 42
wr = weakref.ref(wn)  # OK

何时使用

创建大量小对象（10万+）时: 必须用 slots
数据模型/DTO 类: 推荐使用
不需要动态属性的场景: 推荐使用
需要动态属性: 不要用 slots（或加 dict）
需要弱引用: 在 slots 中加 weakref

11.7 缓存策略

Java/Kotlin 对比

// Java: Caffeine, Guava Cache, Spring @Cacheable
// 功能丰富: 过期策略、大小限制、统计、监听器

import com.github.benmanes.caffeine.cache.Cache;
import com.github.benmanes.caffeine.cache.Caffeine;
import java.util.concurrent.TimeUnit;

// Caffeine Cache
Cache<String, String> cache = Caffeine.newBuilder()
    .maximumSize(10_000)                    // 最大条目数
    .expireAfterWrite(10, TimeUnit.MINUTES) // 写入后 10 分钟过期
    .expireAfterAccess(5, TimeUnit.MINUTES) // 访问后 5 分钟过期
    .recordStats()                          // 记录统计
    .build();

cache.put("key", "value");
String value = cache.getIfPresent("key");  // null if miss
String value2 = cache.get("key", k -> expensiveCompute(k));  // 计算式加载

// 统计信息
System.out.println(cache.stats());
// hitRate, missRate, evictionCount, etc.

// Kotlin: 可以用同样的 Java 缓存库
// 或用 kotlinx.coroutines 的缓存模式
// 自定义简易缓存:
class LruCache<K, V>(private val maxSize: Int) {
    private val cache = LinkedHashMap<K, V>(maxSize, 0.75f, true)
    fun get(key: K): V? = cache[key]
    fun put(key: K, value: V) {
        cache[key] = value
        if (cache.size > maxSize) {
            cache.remove(cache.keys.first())
        }
    }
}

Python 实现

@lru_cache 的函数式编程视角（偏函数、单分派）详见 5.5 functools，本节聚焦缓存性能优化。

from functools import lru_cache, cache, cached_property
import time
import timeit

# === 1. @lru_cache: LRU 缓存 ===
@lru_cache(maxsize=128)
def fibonacci(n):
    """经典: 无缓存 O(2^n)，有缓存 O(n)"""
    if n < 2:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

print(f"fibonacci(100) = {fibonacci(100)}")
print(f"缓存信息: {fibonacci.cache_info()}")
# CacheInfo(hits=98, misses=101, maxsize=128, currsize=101)

# 清除缓存
fibonacci.cache_clear()
print(f"清除后: {fibonacci.cache_info()}")

# === 2. @lru_cache 的 typed 参数 ===
@lru_cache(maxsize=128, typed=True)
def add(a, b):
    """typed=True: 1+2 和 1.0+2.0 视为不同调用"""
    return a + b

add(1, 2)      # 缓存条目 1
add(1.0, 2.0)  # 缓存条目 2 (typed=True 时视为不同)
print(f"typed=True 缓存: {add.cache_info()}")

# === 3. @cache (3.9+): 无限大小缓存 ===
@cache
def expensive_computation(x):
    """无大小限制的缓存 — 注意内存泄漏风险"""
    time.sleep(0.001)  # 模拟耗时操作
    return x * x

# 第一次调用: ~0.001s
# 后续调用: ~0s
t1 = timeit.timeit(lambda: expensive_computation(42), number=1000)
t2 = timeit.timeit(lambda: expensive_computation(43), number=1000)
print(f"\n缓存命中 (42): {t1:.4f}s  (1000次)")
print(f"缓存未命中 (43): {t2:.4f}s  (1000次, 第一次慢)")

# === 4. @cached_property (3.8+): 属性缓存 ===
class DataProcessor:
    def __init__(self, data):
        self.data = data

    @cached_property
    def processed(self):
        """只在第一次访问时计算，之后返回缓存值"""
        print("  计算中...")  # 只会打印一次
        return [x * 2 for x in self.data]

    @cached_property
    def summary(self):
        """依赖其他 cached_property"""
        print("  计算摘要...")
        return sum(self.processed)

dp = DataProcessor([1, 2, 3])
print("第一次访问 processed:")
print(dp.processed)  # 计算中... → [2, 4, 6]
print("第二次访问 processed:")
print(dp.processed)  # [2, 4, 6]（不再计算）

print("访问 summary:")
print(dp.summary)    # 计算摘要... → 12

# 清除缓存
del dp.processed
print("清除缓存后再次访问:")
print(dp.processed)  # 计算中... → [2, 4, 6]

# === 5. 缓存性能 benchmark ===
def no_cache_fib(n):
    if n < 2:
        return n
    return no_cache_fib(n - 1) + no_cache_fib(n - 2)

# 无缓存版本只能算到 n=30 左右（太慢了）
t_no_cache = timeit.timeit(lambda: no_cache_fib(30), number=1)
t_with_cache = timeit.timeit(
    lambda: fibonacci(30),
    setup='from __main__ import fibonacci',
    number=1
)

print(f"\n=== 缓存性能对比 (fib(30)) ===")
print(f"无缓存: {t_no_cache:.4f}s")
print(f"有缓存: {t_with_cache:.6f}s")
print(f"缓存快 {t_no_cache/t_with_cache:.0f}x")

# === 6. 缓存失效策略 ===
# Python 标准库的缓存没有 TTL（过期时间）
# 如果需要 TTL，用第三方库 cachetools

# cachetools 示例 (pip install cachetools)
from cachetools import cached, TTLCache, LRUCache

# TTL 缓存: 5 分钟后过期
# cache = TTLCache(maxsize=1000, ttl=300)
# @cached(cache)
# def get_user(user_id):
#     return db.query(user_id)

# === 7. 手动实现带 TTL 的缓存 ===
import time

class SimpleCache:
    """简易 TTL 缓存"""
    def __init__(self, ttl=60):
        self._cache = {}
        self._ttl = ttl

    def get(self, key, default=None):
        if key in self._cache:
            value, timestamp = self._cache[key]
            if time.time() - timestamp < self._ttl:
                return value
            else:
                del self._cache[key]  # 过期，删除
        return default

    def set(self, key, value):
        self._cache[key] = (value, time.time())

    def invalidate(self, key):
        self._cache.pop(key, None)

    def clear(self):
        self._cache.clear()

cache = SimpleCache(ttl=5)
cache.set("user:1", {"name": "Alice"})
print(cache.get("user:1"))  # {'name': 'Alice'}
time.sleep(6)
print(cache.get("user:1"))  # None (已过期)

核心差异

特性	@lru_cache	@cache	@cached_property	Java Caffeine
大小限制	有	无	无	有
过期策略	无	无	无	TTL/TTF
线程安全	是	是	是	是
统计信息	cache_info()	cache_info()	无	stats()
LRU 淘汰	有	无	无	有
适用场景	纯函数	纯函数	计算属性	通用缓存

常见陷阱

# 陷阱 1: 缓存可变参数
@lru_cache(maxsize=128)
def process(items):
    return sum(items)

try:
    process([1, 2, 3])  # TypeError: unhashable type: 'list'!
except TypeError as e:
    print(f"不可哈希参数: {e}")

# 正确: 用元组
@lru_cache(maxsize=128)
def process(items):
    return sum(items)

print(process((1, 2, 3)))  # OK

# 陷阱 2: @cache 导致内存泄漏
# @cache 无大小限制，缓存会无限增长
# 对无界输入必须用 @lru_cache(maxsize=...)
# 或手动管理缓存

# 陷阱 3: 缓存了可变对象
@lru_cache(maxsize=128)
def get_config():
    return {'debug': True, 'port': 8080}

config = get_config()
config['debug'] = False  # 修改了缓存中的对象！
config2 = get_config()
print(config2['debug'])  # False — 缓存被污染了！
# 正确: 返回不可变对象或副本

# 陷阱 4: 忘记清除缓存
# 如果函数行为可能变化（如读取文件），需要手动清除缓存
# func.cache_clear()

何时使用

递归函数（如 fibonacci）: @lru_cache
耗时纯函数: @lru_cache 或 @cache
计算属性: @cached_property
需要 TTL 过期: 用 cachetools 第三方库
需要缓存失效通知: 用自定义缓存类

11.8 字符串优化

Java/Kotlin 对比

// Java: 字符串是不可变的 (final class)
// JVM 有字符串池 (String Pool / Intern Pool)
// 字符串拼接: javac 自动用 StringBuilder 优化

// Java 字符串拼接
String s = "Hello" + " " + "World";
// 编译后等价于: new StringBuilder().append("Hello").append(" ").append("World").toString()

// Java 字符串格式化
String.format("Hello, %s! You are %d years old.", name, age);  // 较慢
"Hello, " + name + "! You are " + age + " years old.";          // 较快

// Kotlin: 字符串模板
val s = "Hello, $name! You are $age years old."
// 编译后: StringBuilder + append

Python 实现

import timeit

# === 1. 字符串拼接方式对比 ===
N = 10000

# 方式 1: + 拼接
t_plus = timeit.timeit(
    "s = ''\nfor i in range(N): s += str(i)",
    setup=f'N = {N}',
    number=100
)

# 方式 2: str.join
t_join = timeit.timeit(
    "''.join(str(i) for i in range(N))",
    setup=f'N = {N}',
    number=100
)

# 方式 3: 列表收集 + join
t_list_join = timeit.timeit(
    "parts = [str(i) for i in range(N)]; ''.join(parts)",
    setup=f'N = {N}',
    number=100
)

# 方式 4: io.StringIO
t_stringio = timeit.timeit(
    """
import io
buf = io.StringIO()
for i in range(N):
    buf.write(str(i))
s = buf.getvalue()
""",
    setup=f'N = {N}',
    number=100
)

print(f"=== 字符串拼接性能 (n={N}) ===")
print(f"+ 拼接:      {t_plus:.4f}s  (O(n²))")
print(f"join:        {t_join:.4f}s  (O(n))")
print(f"列表+join:   {t_list_join:.4f}s  (O(n))")
print(f"StringIO:    {t_stringio:.4f}s  (O(n))")
# 结论: 大量拼接用 join，少量拼接用 + 无所谓

# === 2. 字符串格式化对比 ===
name = "World"
age = 42

# f-string (推荐)
t_fstring = timeit.timeit(
    "f'Hello, {name}! Age: {age}'",
    setup="name='World'; age=42",
    number=1_000_000
)

# % 格式化
t_percent = timeit.timeit(
    "'Hello, %s! Age: %d' % (name, age)",
    setup="name='World'; age=42",
    number=1_000_000
)

# str.format()
t_format = timeit.timeit(
    "'Hello, {}! Age: {}'.format(name, age)",
    setup="name='World'; age=42",
    number=1_000_000
)

# Template
from string import Template
t_template = timeit.timeit(
    "Template('Hello, $name! Age: $age').substitute(name=name, age=age)",
    setup="from string import Template; name='World'; age=42",
    number=1_000_000
)

print(f"\n=== 字符串格式化性能 ===")
print(f"f-string:  {t_fstring:.4f}s  (最快)")
print(f"% 格式化:  {t_percent:.4f}s")
print(f".format(): {t_format:.4f}s")
print(f"Template:  {t_template:.4f}s  (最慢)")
# 结论: f-string 最快且最可读，优先使用

# === 3. 复杂格式化对比 ===
t_fstring_complex = timeit.timeit(
    "f'{name:>10} | {age:05d} | {score:.2f}'",
    setup="name='Alice'; age=42; score=95.678",
    number=1_000_000
)

t_format_complex = timeit.timeit(
    "'{name:>10} | {age:05d} | {score:.2f}'.format(name=name, age=age, score=score)",
    setup="name='Alice'; age=42; score=95.678",
    number=1_000_000
)

print(f"\n=== 复杂格式化 ===")
print(f"f-string:  {t_fstring_complex:.4f}s")
print(f".format(): {t_format_complex:.4f}s")
print(f"f-string 快 {t_format_complex/t_fstring_complex:.2f}x")

# === 4. 字符串驻留 (intern) ===
import sys

# Python 会自动驻留短字符串和标识符
a = "hello"
b = "hello"
print(f"短字符串: a is b = {a is b}")  # True — 自动驻留

# 长字符串不自动驻留
c = "a" * 1000
d = "a" * 1000
print(f"长字符串: c is d = {c is d}")  # False — 不自动驻留

# 手动驻留
e = sys.intern("a" * 1000)
f = sys.intern("a" * 1000)
print(f"手动驻留: e is f = {e is f}")  # True

# 驻留的好处: 比较时用 is 代替 ==（更快）
# 但注意: is 比较的是身份，不是值
# 只有确定需要身份比较时才用 intern

# === 5. bytes vs str 性能 ===
import timeit

# bytes 操作通常比 str 快（无 Unicode 编解码）
t_str = timeit.timeit(
    "''.join(chr(i % 128) for i in range(N))",
    setup='N = 10000',
    number=1000
)

t_bytes = timeit.timeit(
    "bytes(i % 128 for i in range(N))",
    setup='N = 10000',
    number=1000
)

print(f"\n=== bytes vs str 性能 ===")
print(f"str:   {t_str:.4f}s")
print(f"bytes: {t_bytes:.4f}s")
print(f"bytes 快 {t_str/t_bytes:.2f}x")

# === 6. 字符串查找性能 ===
text = "a" * 1000000 + "needle" + "a" * 1000000

t_find = timeit.timeit(
    "text.find('needle')",
    setup="text = 'a' * 1000000 + 'needle' + 'a' * 1000000",
    number=1000
)

t_index = timeit.timeit(
    "text.index('needle')",
    setup="text = 'a' * 1000000 + 'needle' + 'a' * 1000000",
    number=1000
)

t_in = timeit.timeit(
    "'needle' in text",
    setup="text = 'a' * 1000000 + 'needle' + 'a' * 1000000",
    number=1000
)

print(f"\n=== 字符串查找 ===")
print(f"find():  {t_find:.4f}s")
print(f"index(): {t_index:.4f}s")
print(f"in:      {t_in:.4f}s")
# find() 和 in 速度相近，index() 找不到时抛异常
# find() 返回 -1，in 返回 bool，更 Pythonic

核心差异

操作	Python	Java	说明
少量拼接	+ 或 f-string	+ (编译器优化)	都很快
大量拼接	join	StringBuilder	Python 无自动优化
格式化	f-string	String.format	f-string 更快
字符串驻留	sys.intern	String.intern	机制类似
不可变性	不可变	不可变	一致

常见陷阱

# 陷阱 1: 在循环中用 + 拼接大量字符串
result = ""
for chunk in large_list:
    result += chunk  # O(n²) — 每次创建新字符串

# 正确: 用 join
result = "".join(large_list)  # O(n)

# 陷阱 2: 用 == 比较字符串常量时依赖 intern
# 虽然短字符串自动驻留，但不要依赖这个行为
# 用 == 比较值，不要用 is
if name == "Alice":  # 正确
    pass
if name is "Alice":  # 不推荐（虽然短字符串可能工作）
    pass

# 陷阱 3: 忽略编码开销
# str ↔ bytes 转换有开销
# 如果处理纯 ASCII，考虑用 bytes
data = b"hello world"  # bytes，无编码开销
text = "hello world"   # str，Unicode 编码

何时使用

少量拼接: + 或 f-string
大量拼接: str.join()
格式化: f-string（最快且最可读）
纯 ASCII 处理: 考虑 bytes
字符串比较: ==（不要用 is）

11.9 加速方案对比

Java/Kotlin 对比

// Java/Kotlin: 性能已经很好（JIT 优化后接近 C/C++）
// 一般不需要额外的加速方案
// 如果需要更高性能:
// - JNI (Java Native Interface) 调用 C/C++
// - JNA (Java Native Access) 更简单的 FFI
// - Project Panama (Foreign Function & Memory API, Java 22+)
// - Kotlin/Native: 直接编译为原生代码
// - GraalVM Native Image: AOT 编译

// JNI 示例 (复杂，需要编写 C 代码和头文件)
public native int compute(int[] data);
// 需要 javah 生成头文件 → 编写 C 实现 → 编译为 .so/.dll

// Kotlin/Native: 直接编译为原生代码
// fun compute(data: IntArray): Int { ... }
// 编译: kotlinc-native -o program program.kt
// 生成原生二进制，无 JVM 开销

// Kotlin 多平台: 共享代码，各平台分别编译
// expect fun compute(data: IntArray): Int
// actual fun compute(data: IntArray): Int = nativeCompute(data)

Python 实现

9.1 numpy 向量化

import timeit
import random

# === 纯 Python 循环 vs numpy 向量化 ===

# 纯 Python 版本
def python_sum(n):
    total = 0.0
    for i in range(n):
        total += i * i + i * 0.5
    return total

# numpy 向量化版本
import numpy as np

def numpy_sum(n):
    arr = np.arange(n, dtype=np.float64)
    return np.sum(arr * arr + arr * 0.5)

N = 1_000_000

t_python = timeit.timeit(lambda: python_sum(N), number=10)
t_numpy = timeit.timeit(lambda: numpy_sum(N), number=10)

print(f"=== 纯 Python vs numpy 向量化 (n={N}) ===")
print(f"纯 Python: {t_python:.4f}s")
print(f"numpy:     {t_numpy:.4f}s")
print(f"numpy 快 {t_python/t_numpy:.0f}x")
# 典型: numpy 快 50-100x

# === 更复杂的例子: 矩阵运算 ===
def python_matrix_multiply(n):
    """纯 Python 矩阵乘法 — 极慢"""
    A = [[random.random() for _ in range(n)] for _ in range(n)]
    B = [[random.random() for _ in range(n)] for _ in range(n)]
    C = [[0.0] * n for _ in range(n)]
    for i in range(n):
        for j in range(n):
            for k in range(n):
                C[i][j] += A[i][k] * B[k][j]
    return C

def numpy_matrix_multiply(n):
    """numpy 矩阵乘法 — 极快"""
    A = np.random.rand(n, n)
    B = np.random.rand(n, n)
    return A @ B  # 或 np.dot(A, B)

# 100x100 矩阵
t_py_mat = timeit.timeit(lambda: python_matrix_multiply(100), number=1)
t_np_mat = timeit.timeit(lambda: numpy_matrix_multiply(100), number=100)

print(f"\n=== 矩阵乘法 100x100 ===")
print(f"纯 Python (1次): {t_py_mat:.4f}s")
print(f"numpy (100次):   {t_np_mat:.4f}s")
print(f"numpy 每次:      {t_np_mat/100:.6f}s")
print(f"numpy 快 {t_py_mat/(t_np_mat/100):.0f}x")
# 典型: numpy 快 100-500x

# === numpy 广播机制 ===
def python_normalize(data):
    """纯 Python: 归一化"""
    mean = sum(data) / len(data)
    std = (sum((x - mean) ** 2 for x in data) / len(data)) ** 0.5
    return [(x - mean) / std for x in data]

def numpy_normalize(data):
    """numpy: 归一化"""
    arr = np.array(data)
    return (arr - arr.mean()) / arr.std()

data = [random.random() for _ in range(100000)]
t_py_norm = timeit.timeit(lambda: python_normalize(data), number=10)
t_np_norm = timeit.timeit(lambda: numpy_normalize(data), number=100)

print(f"\n=== 归一化 (n=100000) ===")
print(f"纯 Python (10次):  {t_py_norm:.4f}s")
print(f"numpy (100次):     {t_np_norm:.4f}s")
print(f"numpy 快 {t_py_norm/10/(t_np_norm/100):.0f}x")

9.2 Numba: JIT 编译数值计算

# 安装: pip install numba
# Numba: 用 @jit 装饰器将 Python 函数编译为机器码
# 适用于: 数值计算、循环密集型代码

from numba import jit
import timeit
import math

# === 纯 Python 版本 ===
def monte_carlo_pi_python(n):
    """蒙特卡洛方法计算 pi"""
    count = 0
    for i in range(n):
        x = random.random()
        y = random.random()
        if x * x + y * y <= 1.0:
            count += 1
    return 4.0 * count / n

# === Numba JIT 版本 ===
@jit(nopython=True)  # nopython=True: 编译为纯机器码，不依赖 Python
def monte_carlo_pi_numba(n):
    """蒙特卡洛方法计算 pi — Numba JIT 编译"""
    count = 0
    for i in range(n):
        x = random.random()
        y = random.random()
        if x * x + y * y <= 1.0:
            count += 1
    return 4.0 * count / n

import random

N = 1_000_000

# 第一次调用: 编译（慢）
_ = monte_carlo_pi_numba(1000)

# 后续调用: 编译后的机器码（快）
t_python = timeit.timeit(lambda: monte_carlo_pi_python(N), number=10)
t_numba = timeit.timeit(lambda: monte_carlo_pi_numba(N), number=10)

print(f"=== 蒙特卡洛 pi (n={N}) ===")
print(f"纯 Python: {t_python:.4f}s")
print(f"Numba:     {t_numba:.4f}s")
print(f"Numba 快 {t_python/t_numba:.0f}x")
# 典型: Numba 快 50-200x

# === Numba 支持的特性 ===
# - 基本数值类型 (int, float, complex)
# - numpy 数组
# - 循环
# - 条件分支
# - 部分 math 函数
# 不支持: Python 对象、列表、字典、字符串操作

# === Numba 并行 ===
from numba import prange

@jit(nopython=True, parallel=True)
def parallel_sum(arr):
    """并行求和"""
    total = 0
    for i in prange(len(arr)):
        total += arr[i]
    return total

arr = np.random.rand(10_000_000)
t_serial = timeit.timeit(lambda: np.sum(arr), number=10)
t_parallel = timeit.timeit(lambda: parallel_sum(arr), number=10)

print(f"\n=== 并行求和 (n=10M) ===")
print(f"numpy 串行: {t_serial:.4f}s")
print(f"Numba 并行: {t_parallel:.4f}s")

9.3 Cython: Python → C 编译

# Cython: 将 Python 代码编译为 C 扩展
# 安装: pip install cython

# === 步骤 1: 创建 .pyx 文件 ===
# 文件: fast_compute.pyx
#
# def python_sum(int n):
#     cdef int i
#     cdef double total = 0.0
#     for i in range(n):
#         total += i * i + i * 0.5
#     return total
#
# 关键: cdef 声明 C 类型变量，避免 Python 对象开销

# === 步骤 2: 创建 setup.py ===
# 文件: setup.py
#
# from setuptools import setup
# from Cython.Build import cythonize
#
# setup(
#     ext_modules = cythonize("fast_compute.pyx")
# )
#
# 编译: python setup.py build_ext --inplace
# 生成: fast_compute.cpython-310-x86_64-linux-gnu.so

# === 步骤 3: 使用编译后的模块 ===
# import fast_compute
# result = fast_compute.python_sum(1_000_000)

# === Cython 性能对比示意 ===
# 纯 Python: 0.15s
# Cython (无类型): 0.05s (3x)
# Cython (有类型): 0.002s (75x)
# C 原生: 0.001s (150x)

# Cython 的优势:
# 1. 渐进式优化: 先加类型注解，再改算法
# 2. 可以调用 C 库
# 3. 与 Python 无缝集成
# Cython 的劣势:
# 1. 需要额外的编译步骤
# 2. 需要学习 Cython 语法
# 3. 调试困难

9.4 C 扩展: ctypes/cffi

import ctypes
import timeit

# === ctypes: 调用 C 共享库 ===
# 最简单的 FFI 方式，标准库自带

# 示例: 调用 libc 的 qsort
# (实际使用中，通常调用自己编译的 .so/.dll)

# 简单示例: 调用 libc 的 abs
libc = ctypes.CDLL(None)  # 加载默认 C 库
abs_func = libc.abs
abs_func.argtypes = [ctypes.c_int]
abs_func.restype = ctypes.c_int

print(f"abs(-42) = {abs_func(-42)}")

# === cffi: 更现代的 FFI ===
# 安装: pip install cffi
# 支持直接嵌入 C 代码，自动编译

# from cffi import FFI
# ffi = FFI()
# ffi.cdef("""
#     int abs(int j);
# """)
# C = ffi.dlopen(None)  # 加载默认 C 库
# print(C.abs(-42))

# === ctypes vs cffi vs Cython ===
# ctypes: 最简单，但性能一般（每次调用有转换开销）
# cffi: 更灵活，支持嵌入 C 代码，性能好
# Cython: 最强性能，但需要编译步骤

# === vs JNI / Kotlin Native ===
# JNI: 类似 ctypes，但更复杂（需要编写 JNI 桥接代码）
# Kotlin Native: 直接编译为原生代码，无 FFI 开销
# Python FFI: 比 JNI 简单，但性能不如 Kotlin Native

9.5 PyPy: 替代解释器

# PyPy: 用 RPython 编写的 Python 解释器
# 内置 JIT 编译器，对纯 Python 代码提速显著
# 安装: 下载 PyPy 二进制包

# PyPy 特点:
# 1. 内置 JIT: 热点代码编译为机器码
# 2. 兼容 Python 3.9+
# 3. 纯 Python 代码提速 3-10x
# 4. C 扩展兼容性不完整

# PyPy vs CPython:
# - 纯 Python 计算: PyPy 快 3-10x
# - numpy/pandas: PyPy 可能更慢（C 扩展兼容性问题）
# - 启动时间: PyPy 更慢（JIT 预热）
# - 内存: PyPy 使用更多内存（JIT 代码缓存）

# PyPy 适用场景:
# - 纯 Python 计算密集型应用
# - 长时间运行的服务
# - 不重度依赖 C 扩展

# PyPy 不适用:
# - 短命脚本（JIT 预热来不及）
# - 重度依赖 numpy/pandas
# - 需要 CPython 特定 C 扩展

加速方案选择决策树

需要加速 Python 代码？
├── 能用内置函数/库？
│   └── 用 sum(), max(), sorted() 等 C 实现 → 通常快 10-100x
│
├── 数值计算/科学计算？
│   └── numpy/pandas 向量化 → 快 50-500x
│
├── 循环密集型数值计算？
│   ├── 简单: Numba @jit → 快 50-200x，改动最小
│   └── 复杂: Cython → 快 50-150x，需要编译
│
├── 需要调用 C/C++ 库？
│   ├── 简单调用: ctypes (标准库)
│   ├── 嵌入 C 代码: cffi
│   └── 深度集成: Cython
│
├── 纯 Python 逻辑密集？
│   └── PyPy 替代解释器 → 快 3-10x
│
├── CPU 密集型？
│   └── multiprocessing 多进程 → 利用多核
│
└── 不确定？
    └── 先 profile → 找到瓶颈 → 针对性优化

核心差异

方案	提速	改动成本	适用场景	vs Java/Kotlin
内置函数	10-100x	极低	通用	类似 JIT 内联
numpy	50-500x	低	数值计算	类似 Stream API
Numba	50-200x	低	数值循环	类似 JIT
Cython	50-150x	中	性能关键	类似 JNI
ctypes/cffi	取决于 C 代码	中	调用 C 库	类似 JNI
PyPy	3-10x	极低	纯 Python	类似 JIT
multiprocessing	N 核	低	CPU 密集	类似 parallel stream

常见陷阱

# 陷阱 1: 过早引入 numpy
# 如果数据量小（<1000），numpy 的创建数组开销可能超过计算收益
import numpy as np
import timeit

# 小数据: 纯 Python 可能更快
small = list(range(100))
t_py = timeit.timeit('sum(x*x for x in data)', setup='data=list(range(100))', number=10000)
t_np = timeit.timeit('np.sum(np.array(data)**2)', setup='import numpy as np; data=list(range(100))', number=10000)
print(f"小数据: Python={t_py:.4f}s, numpy={t_np:.4f}s")
# numpy 创建数组有开销，小数据时可能更慢

# 陷阱 2: Numba 第一次调用包含编译时间
# 第一次调用很慢（编译），后续调用才快
# benchmark 时要排除第一次调用

# 陷阱 3: Cython 的 GIL
# 默认 Cython 代码持有 GIL
# 释放 GIL: with nogil:
# 需要确保 nogil 块内不调用 Python API

# 陷阱 4: multiprocessing 的序列化开销
# 进程间通信需要 pickle 序列化
# 传递大数据时有显著开销
# 解决: 用共享内存 (multiprocessing.shared_memory)

何时使用

优先用内置函数和标准库（C 实现）
数值计算: numpy/pandas
循环密集: Numba 或 Cython
调用 C 库: ctypes/cffi
纯 Python 加速: PyPy
多核并行: multiprocessing

11.10 内存管理: 引用计数 + 分代GC

Java/Kotlin 对比

// Java: 纯分代 GC
// 新生代 (Young Gen) → 存活对象晋升到老年代 (Old Gen)
// GC 算法: G1 (默认), ZGC, Shenandoah
// 对象不可达时由 GC 统一回收
// 开发者不需要（也不能）手动管理内存

// Java 引用类型:
// - Strong Reference: 默认，阻止 GC
// - SoftReference: 内存不足时才回收（适合缓存）
// - WeakReference: 每次 GC 都可能回收
// - PhantomReference: 用于资源清理

import java.lang.ref.WeakReference;
Object obj = new Object();
WeakReference<Object> ref = new WeakReference<>(obj);
obj = null;  // 对象可以被 GC 回收
System.gc();
Object recovered = ref.get();  // 可能返回 null

// Kotlin: 同 Java（JVM 上）
// 另外有循环引用不会泄漏（GC 处理）

Python 实现

10.1 引用计数机制

import sys
import gc

# === 1. 引用计数基础 ===
a = [1, 2, 3]
print(f"创建后:   refcount = {sys.getrefcount(a)}")  # 2 (a + getrefcount参数)

b = a
print(f"赋值后:   refcount = {sys.getrefcount(a)}")  # 3 (a + b + getrefcount)

c = [a, a, a]
print(f"列表引用: refcount = {sys.getrefcount(a)}")  # 6 (a + b + c[0] + c[1] + c[2] + getrefcount)

del b
print(f"del b后:  refcount = {sys.getrefcount(a)}")  # 5

del c
print(f"del c后:  refcount = {sys.getrefcount(a)}")  # 2

# 引用计数归零 → 立即释放内存
# 这是 Python 的主要内存回收机制
# 优点: 即时回收，不需要等 GC 周期
# 缺点: 无法处理循环引用

# === 2. 引用计数 vs JVM GC ===
# Python: 引用计数归零 → 立即释放（确定性）
# Java: 对象不可达 → 等 GC 周期 → 非确定性
# Python 的确定性回收对资源管理很重要（文件、锁等）

10.2 循环引用问题

import gc
import weakref

# === 1. 循环引用导致内存泄漏 ===
class Node:
    def __init__(self, value):
        self.value = value
        self.next = None

    def __del__(self):
        print(f"  Node {self.value} 被回收")

print("=== 循环引用演示 ===")

a = Node(1)
b = Node(2)
a.next = b
b.next = a  # 循环引用!

print("删除外部引用...")
del a
del b

# 此时 a 和 b 的引用计数都不为 0（互相引用）
# 但它们已经不可达了
# 引用计数无法回收！

print("手动触发 GC...")
collected = gc.collect()  # 回收循环引用
print(f"GC 回收了 {collected} 个对象")

# === 2. gc 模块控制 ===
print(f"\n=== GC 配置 ===")
print(f"GC 阈值: {gc.get_threshold()}")
# (700, 10, 10)
# 含义:
# 第 0 代: 分配 700 个对象后检查
# 第 1 代: 经过 10 次 0 代 GC 后检查
# 第 2 代: 经过 10 次 1 代 GC 后检查

print(f"GC 计数: {gc.get_count()}")
print(f"GC 是否启用: {gc.isenabled()}")

# === 3. gc 模块调试 ===
gc.set_debug(gc.DEBUG_STATS)

# 查看被 GC 跟踪的对象
print(f"\n被跟踪的对象数: {len(gc.get_objects())}")

# 查看特定类型的对象
refs = [obj for obj in gc.get_objects() if isinstance(obj, list)]
print(f"list 对象数: {len(refs)}")

gc.set_debug(0)  # 关闭调试

10.3 弱引用

import weakref
import gc

# === 1. 弱引用基础 ===
class MyClass:
    def __init__(self, name):
        self.name = name

obj = MyClass("test")
ref = weakref.ref(obj)  # 创建弱引用

print(f"弱引用存在: {ref() is not None}")  # True
print(f"对象名: {ref().name}")  # test

del obj  # 删除唯一强引用
# 引用计数归零 → 对象被回收

print(f"删除后: {ref() is None}")  # True — 弱引用返回 None

# === 2. 弱引用打破循环引用 ===
class Parent:
    def __init__(self, name):
        self.name = name
        self._children = []  # 强引用列表

    def add_child(self, child):
        self._children.append(weakref.ref(child))

    def get_children(self):
        return [ref() for ref in self._children if ref() is not None]

class Child:
    def __init__(self, name, parent):
        self.name = name
        self.parent = weakref.ref(parent)  # 弱引用父节点

parent = Parent("root")
child1 = Child("a", parent)
child2 = Child("b", parent)

parent.add_child(child1)
parent.add_child(child2)

print(f"\n子节点: {[c.name for c in parent.get_children()]}")

del child1, child2  # 子节点可以被回收（父节点只持有弱引用）
print(f"删除后子节点: {parent.get_children()}")

# === 3. WeakKeyDictionary / WeakValueDictionary ===
# WeakKeyDictionary: key 是弱引用，key 被回收时自动删除条目
# WeakValueDictionary: value 是弱引用，value 被回收时自动删除条目

cache = weakref.WeakValueDictionary()

class Data:
    def __init__(self, key):
        self.key = key

d1 = Data("key1")
cache["key1"] = d1
print(f"\n缓存大小: {len(cache)}")  # 1

del d1  # Data 对象被回收
print(f"删除后缓存大小: {len(cache)}")  # 0 — 自动清理

10.4 del 的正确用法和陷阱

import gc

# === 1. __del__ 基本用法 ===
class Resource:
    def __init__(self, name):
        self.name = name
        print(f"  {name}: 创建")

    def __del__(self):
        print(f"  {self.name}: 销毁")

print("=== __del__ 基本用法 ===")
r = Resource("r1")
del r  # 立即调用 __del__（引用计数归零）
# 输出: r1: 销毁

# === 2. __del__ 与循环引用 ===
class A:
    def __init__(self):
        self.b = None
        print("  A: 创建")

    def __del__(self):
        print("  A: 销毁")

class B:
    def __init__(self):
        self.a = None
        print("  B: 创建")

    def __del__(self):
        print("  B: 销毁")

print("\n=== __del__ 与循环引用 ===")
a = A()
b = B()
a.b = b
b.a = a

del a, b  # 不会立即调用 __del__（循环引用）
print("触发 GC...")
gc.collect()  # 现在才调用 __del__
# Python 3.4+ (PEP 442): 可以正确处理 __del__ 中的循环引用
# 旧版本: 可能导致对象不可回收

# === 3. __del__ 的正确替代方案 ===
# 推荐用上下文管理器 (with) 替代 __del__
class SafeResource:
    def __init__(self, name):
        self.name = name
        print(f"  {name}: 打开资源")

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        print(f"  {self.name}: 关闭资源")
        return False

    def do_work(self):
        print(f"  {self.name}: 工作中")

print("\n=== 上下文管理器 (推荐) ===")
with SafeResource("db_conn") as r:
    r.do_work()
# 自动调用 __exit__，确保资源释放

10.5 内存分析工具

import tracemalloc
import sys
import gc

# === 1. tracemalloc: 跟踪内存分配 ===
tracemalloc.start(25)  # 记录最多 25 帧的调用栈

# 分配一些内存
data = [list(range(1000)) for _ in range(1000)]

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

print("=== 内存分配 Top 5 ===")
for stat in top_stats[:5]:
    print(stat)

# === 2. 比较两个快照找泄漏 ===
tracemalloc.stop()
tracemalloc.start(25)

snapshot1 = tracemalloc.take_snapshot()

# 可能泄漏的代码
leaky_list = []
for i in range(10000):
    leaky_list.append({'data': list(range(100))})

snapshot2 = tracemalloc.take_snapshot()

top_diffs = snapshot2.compare_to(snapshot1, 'lineno')
print("\n=== 内存增长 Top 5 ===")
for stat in top_diffs[:5]:
    print(stat)

tracemalloc.stop()

# === 3. 对象计数 ===
print(f"\n=== 当前对象统计 ===")
print(f"list 对象: {sum(1 for obj in gc.get_objects() if isinstance(obj, list))}")
print(f"dict 对象: {sum(1 for obj in gc.get_objects() if isinstance(obj, dict))}")
print(f"tuple 对象: {sum(1 for obj in gc.get_objects() if isinstance(obj, tuple))}")

# === 4. sys.getsizeof 详解 ===
print(f"\n=== 对象大小 ===")
print(f"空 list:    {sys.getsizeof([])} bytes")
print(f"空 dict:    {sys.getsizeof({})} bytes")
print(f"空 set:     {sys.getsizeof(set())} bytes")
print(f"空 tuple:   {sys.getsizeof(())} bytes")
print(f"空 str:     {sys.getsizeof('')} bytes")
print(f"int 0:      {sys.getsizeof(0)} bytes")
print(f"int 2**30:  {sys.getsizeof(2**30)} bytes")
print(f"float 1.0:  {sys.getsizeof(1.0)} bytes")
print(f"bool True:  {sys.getsizeof(True)} bytes")
print(f"None:       {sys.getsizeof(None)} bytes")

# 注意: sys.getsizeof 只测量对象本身，不包含引用的对象
# 例如: sys.getsizeof([1,2,3]) 只测量 list 对象头，不包含 1,2,3
# 要测量完整大小，用 pympler.asizeof 或递归计算

核心差异

维度	JVM GC	Python GC
主要机制	分代 GC	引用计数
辅助机制	无	分代 GC（处理循环引用）
回收时机	GC 周期（不确定）	引用归零立即回收
STW 停顿	有（GC 时）	有（GC 循环检测时）
手动控制	System.gc()（建议）	gc.collect()
循环引用	自动处理	需要辅助 GC
弱引用	WeakReference	weakref.ref
软引用	SoftReference	无（用 cachetools）
资源清理	try-with-resources / AutoCloseable	with / enter/exit

常见陷阱

# 陷阱 1: __del__ 中访问外部资源可能失败
class Danger:
    def __init__(self):
        self.file = open('/tmp/test', 'w')

    def __del__(self):
        self.file.close()  # 如果解释器关闭时调用，可能失败

# 正确: 用上下文管理器
class Safe:
    def __enter__(self):
        self.file = open('/tmp/test', 'w')
        return self

    def __exit__(self, *args):
        self.file.close()

# 陷阱 2: 闭包持有大对象
def create_processor():
    big_data = list(range(1_000_000))  # 大列表
    def process(x):
        return x + big_data[0]  # 闭包引用 big_data
    return process

f = create_processor()
# big_data 不会被回收，因为 f 持有闭包引用
# 解决: 不需要时 del f，或用 weakref

# 陷阱 3: gc.disable() 的风险
# gc.disable() 禁用自动 GC
# 循环引用的对象永远不会被回收
# 只在特殊场景（如实时性要求极高）临时使用

何时使用

理解引用计数是理解 Python 内存行为的基础
循环引用场景: 用 weakref 或确保 gc.collect() 能运行
大量临时对象: 考虑对象池或 slots
资源管理: 用 with 语句，不要依赖 del
内存泄漏排查: tracemalloc + gc.get_objects()

11.11 并发性能选型

Java/Kotlin 对比

// Java: 真正的多线程（共享内存模型）
// - Thread: 底层 OS 线程
// - ExecutorService: 线程池
// - CompletableFuture: 异步编程
// - Virtual Threads (Java 21+): 轻量级线程（类似协程）

// Java 线程池
ExecutorService executor = Executors.newFixedThreadPool(4);
List<Future<Integer>> futures = new ArrayList<>();
for (int i = 0; i < 10; i++) {
    futures.add(executor.submit(() -> heavyComputation()));
}
for (Future<Integer> f : futures) {
    result += f.get();  // 等待结果
}

// Java: CPU 密集型用多线程，I/O 密集型也用多线程
// 因为 Java 没有真正的协程（Virtual Threads 之前）

// Kotlin: 协程 (coroutines)
// - 轻量级，比线程便宜得多
// - 挂起函数 (suspend function)
// - 结构化并发

import kotlinx.coroutines.*

// CPU 密集型
suspend fun cpuBound() = withContext(Dispatchers.Default) {
    // 默认并行度 = CPU 核心数
    heavyComputation()
}

// I/O 密集型
suspend fun ioBound() = withContext(Dispatchers.IO) {
    // 默认 64 个线程
    httpClient.get(url)
}

// 并行执行
val results = coroutineScope {
    listOf(async { task1() }, async { task2() })
}.awaitAll()

Python 实现

11.1 CPU 密集型: multiprocessing vs threading vs asyncio

import time
import timeit
import multiprocessing
import threading
import asyncio

# === CPU 密集型任务 ===
def cpu_task(n):
    """CPU 密集型: 纯计算"""
    total = 0
    for i in range(n):
        total += i * i
    return total

N = 5_000_000

# === 1. 串行 ===
def serial():
    results = []
    for _ in range(4):
        results.append(cpu_task(N))
    return results

# === 2. threading (受 GIL 限制) ===
def threading_test():
    results = [None] * 4
    threads = []
    for i in range(4):
        t = threading.Thread(target=lambda idx=i: results.__setitem__(idx, cpu_task(N)))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()
    return results

# === 3. multiprocessing (绕过 GIL) ===
def multiprocessing_test():
    with multiprocessing.Pool(4) as pool:
        results = pool.map(cpu_task, [N] * 4)
    return results

# === 4. asyncio (CPU 密集型不适用，但演示) ===
async def async_cpu_task(n):
    """asyncio 不适合 CPU 密集型（会阻塞事件循环）"""
    return cpu_task(n)

async def asyncio_test():
    tasks = [async_cpu_task(N) for _ in range(4)]
    return await asyncio.gather(*tasks)

# === benchmark ===
print("=== CPU 密集型 benchmark ===")
print(f"串行:")
t = timeit.timeit(serial, number=3)
print(f"  耗时: {t:.4f}s")

print(f"threading (4线程):")
t = timeit.timeit(threading_test, number=3)
print(f"  耗时: {t:.4f}s")

print(f"multiprocessing (4进程):")
t = timeit.timeit(multiprocessing_test, number=3)
print(f"  耗时: {t:.4f}s")

print(f"asyncio (4协程):")
t = timeit.timeit(lambda: asyncio.run(asyncio_test()), number=3)
print(f"  耗时: {t:.4f}s")

# 典型结果 (4核机器):
# 串行:              ~6.0s  (基准)
# threading:         ~6.2s  (GIL 限制，甚至更慢！)
# multiprocessing:    ~1.8s  (接近 4x 加速)
# asyncio:           ~6.0s  (单线程，无加速)
#
# 结论: CPU 密集型必须用 multiprocessing

11.2 I/O 密集型: threading vs asyncio

import time
import threading
import asyncio
import urllib.request

# === I/O 密集型任务 ===
def io_task(url):
    """模拟 I/O 操作（网络请求）"""
    try:
        resp = urllib.request.urlopen(url, timeout=5)
        return len(resp.read())
    except Exception:
        return 0

# 用 sleep 模拟 I/O（更可控）
def io_task_simulated(delay):
    """模拟 I/O 延迟"""
    time.sleep(delay)
    return delay

async def async_io_task_simulated(delay):
    """asyncio 版本"""
    await asyncio.sleep(delay)
    return delay

URLS = ['https://httpbin.org/delay/0.1'] * 8

# === 1. 串行 ===
def serial_io():
    results = []
    for url in URLS:
        results.append(io_task(url))
    return results

# === 2. threading ===
def threading_io():
    results = [None] * len(URLS)
    threads = []
    for i, url in enumerate(URLS):
        t = threading.Thread(target=lambda u=url, idx=i: results.__setitem__(idx, io_task(u)))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()
    return results

# === 3. asyncio ===
async def asyncio_io():
    loop = asyncio.get_event_loop()
    tasks = [loop.run_in_executor(None, io_task, url) for url in URLS]
    return await asyncio.gather(*tasks)

# === 4. asyncio 原生 (aiohttp) ===
# 需要安装: pip install aiohttp
# async def async_io_native():
#     import aiohttp
#     async with aiohttp.ClientSession() as session:
#         tasks = [session.get(url) for url in URLS]
#         responses = await asyncio.gather(*tasks)
#         return [len(await r.read()) for r in responses]

# === benchmark (用 sleep 模拟) ===
N_IO = 8
DELAY = 0.1

print("\n=== I/O 密集型 benchmark ===")

start = time.perf_counter()
for _ in range(N_IO):
    io_task_simulated(DELAY)
t_serial = time.perf_counter() - start
print(f"串行: {t_serial:.4f}s")

start = time.perf_counter()
threads = [threading.Thread(target=io_task_simulated, args=(DELAY,)) for _ in range(N_IO)]
for t in threads: t.start()
for t in threads: t.join()
t_threading = time.perf_counter() - start
print(f"threading: {t_threading:.4f}s")

start = time.perf_counter()
asyncio.run(asyncio.gather(*[async_io_task_simulated(DELAY) for _ in range(N_IO)]))
t_asyncio = time.perf_counter() - start
print(f"asyncio: {t_asyncio:.4f}s")

# 典型结果:
# 串行:      ~0.8s  (8 x 0.1s)
# threading:  ~0.1s  (并行)
# asyncio:   ~0.1s  (并行)
#
# 结论: I/O 密集型用 threading 或 asyncio 都可以
# asyncio 更轻量（单线程，无锁问题）
# threading 更简单（不需要 async/await）

11.3 concurrent.futures

import time
import concurrent.futures
import multiprocessing

# === concurrent.futures: 统一的并发接口 ===
# ThreadPoolExecutor: 线程池（I/O 密集型）
# ProcessPoolExecutor: 进程池（CPU 密集型）

def cpu_work(n):
    total = 0
    for i in range(n):
        total += i * i
    return total

def io_work(delay):
    time.sleep(delay)
    return delay

# === 1. ProcessPoolExecutor (CPU 密集型) ===
def process_pool_demo():
    with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
        futures = [executor.submit(cpu_work, 5_000_000) for _ in range(8)]
        results = [f.result() for f in concurrent.futures.as_completed(futures)]
    return results

# === 2. ThreadPoolExecutor (I/O 密集型) ===
def thread_pool_demo():
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
        futures = [executor.submit(io_work, 0.1) for _ in range(8)]
        results = [f.result() for f in concurrent.futures.as_completed(futures)]
    return results

# === 3. map 简化版 ===
def map_demo():
    with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
        # 类似内置 map，但并行执行
        results = list(executor.map(cpu_work, [5_000_000] * 8))
    return results

# === benchmark ===
print("=== concurrent.futures benchmark ===")

start = time.perf_counter()
process_pool_demo()
t_proc = time.perf_counter() - start
print(f"ProcessPoolExecutor (8个CPU任务, 4进程): {t_proc:.4f}s")

start = time.perf_counter()
thread_pool_demo()
t_thread = time.perf_counter() - start
print(f"ThreadPoolExecutor (8个I/O任务, 8线程): {t_thread:.4f}s")

start = time.perf_counter()
map_demo()
t_map = time.perf_counter() - start
print(f"ProcessPoolExecutor.map (8个CPU任务, 4进程): {t_map:.4f}s")

并发方案选择决策树

需要并发？
├── CPU 密集型（计算）
│   └── multiprocessing / ProcessPoolExecutor
│       - 绕过 GIL，真正并行
│       - 进程间通信有开销（pickle 序列化）
│       - 适合: 数据处理、数值计算、机器学习
│
├── I/O 密集型（网络/文件/数据库）
│   ├── 简单场景 → threading / ThreadPoolExecutor
│   │   - 简单直接，不需要改代码结构
│   │   - 适合: 少量并发 I/O
│   │
│   ├── 高并发 → asyncio
│   │   - 单线程，无锁问题
│   │   - 适合: Web 服务器、API 客户端、WebSocket
│   │   - 需要: async/await 全链路
│   │
│   └── 混合 → asyncio + run_in_executor
│       - CPU 部分用线程池/进程池
│       - I/O 部分用 asyncio
│
├── 需要共享状态？
│   ├── 少量共享 → threading + Lock/Queue
│   └── 大量共享 → multiprocessing + SharedMemory/Queue
│
└── 不确定？
    └── 先用 concurrent.futures（统一接口，容易切换）

核心差异

维度	Java/Kotlin	Python
CPU 并行	多线程（真正并行）	multiprocessing（绕过 GIL）
I/O 并发	多线程 / 协程	threading / asyncio
线程开销	轻（OS 线程）	轻（OS 线程）
协程	Kotlin 协程	asyncio
GIL	无	有（CPython）
共享内存	天然支持	多进程需要序列化
线程池	ExecutorService	ThreadPoolExecutor
进程池	ProcessBuilder	ProcessPoolExecutor

常见陷阱

# 陷阱 1: CPU 密集型用 threading
# GIL 导致同一时刻只有一个线程执行 Python 字节码
# threading 对 CPU 密集型无帮助，甚至更慢（线程切换开销）

# 陷阱 2: asyncio 中执行 CPU 密集型任务
async def bad():
    result = sum(i * i for i in range(10_000_000))  # 阻塞事件循环！
    return result

# 正确: 用 run_in_executor 放到线程池
async def good():
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(None, lambda: sum(i * i for i in range(10_000_000)))
    return result

# 陷阱 3: multiprocessing 在 Windows 上的限制
# Windows 上 multiprocessing 需要 if __name__ == '__main__' 保护
# 因为 Windows 用 spawn 方式创建子进程（重新导入模块）

# 陷阱 4: 忽略进程间通信开销
# multiprocessing 传递大数据需要 pickle 序列化
# 传递 1GB 数据可能需要几秒钟
# 解决: 用共享内存 (multiprocessing.shared_memory)

何时使用

CPU 密集型: multiprocessing / ProcessPoolExecutor
I/O 密集型（简单）: threading / ThreadPoolExecutor
I/O 密集型（高并发）: asyncio
混合: asyncio + run_in_executor
快速原型: concurrent.futures（统一接口）

本章总结: 性能优化决策树

需要优化性能？
│
├── 第一步: 先 profile！不要猜瓶颈
│   ├── 微基准测试 → timeit
│   ├── 函数级分析 → cProfile
│   ├── 行级分析 → line_profiler
│   ├── 生产环境 → py-spy
│   └── 内存分析 → tracemalloc / memray
│
├── 算法/数据结构问题？
│   ├── O(n) → O(1): set 替代 list 查找
│   ├── O(n²) → O(n log n): 排序 + 二分查找
│   ├── O(n²) → O(n): join 替代 + 拼接
│   └── deque 替代 list.insert(0)
│
├── I/O 瓶颈？
│   ├── 少量并发 → threading
│   ├── 高并发 → asyncio
│   └── 混合 → asyncio + run_in_executor
│
├── CPU 瓶颈？
│   ├── 能用内置函数？ → sum(), max(), sorted() (C 实现)
│   ├── 能用缓存？ → @lru_cache / @cache
│   ├── 数值计算？ → numpy 向量化 (50-500x)
│   ├── 循环密集？ → Numba @jit (50-200x)
│   ├── 需要编译？ → Cython (50-150x)
│   ├── 多核可用？ → multiprocessing
│   └── 纯 Python？ → 升级 3.11+ (免费 25-60%)
│
├── 内存瓶颈？
│   ├── 大量小对象？ → __slots__ (省 40-60%)
│   ├── 大数据集？ → 生成器 (yield)
│   ├── 循环引用？ → weakref
│   ├── 字符串拼接？ → join
│   └── 缓存膨胀？ → lru_cache(maxsize=...)
│
├── 函数调用开销？
│   ├── 热循环中避免小函数调用
│   ├── 属性访问提取为局部变量
│   ├── 全局函数缓存为局部变量
│   └── 用 dis 模块分析字节码
│
└── 还是不够快？
    ├── PyPy (纯 Python 3-10x)
    ├── C 扩展 (ctypes/cffi)
    ├── 重写为 C/C++/Rust
    └── 换语言 (但通常不需要)

性能优化优先级

1. 选对算法和数据结构     → 潜力: 100-10000x
2. 用内置函数和标准库     → 潜力: 10-100x
3. numpy/pandas 向量化    → 潜力: 50-500x
4. Numba/Cython 编译      → 潜力: 50-200x
5. 升级 Python 3.11+      → 潜力: 1.25-1.6x
6. multiprocessing 多进程  → 潜力: N 核
7. 缓存策略               → 潜力: 取决于重复率
8. __slots__ 内存优化      → 潜力: 省内存 40-60%
9. asyncio/threading I/O  → 潜力: 取决于 I/O 等待比例
10. PyPy 替代解释器        → 潜力: 3-10x

一句话总结

Python 性能优化的核心原则: 先 profile，再优化；先算法，再工具；先内置，再扩展。 理解 CPython 的执行模型（解释执行、GIL、引用计数）是一切优化的基础。大多数性能问题可以通过选对数据结构和用对标准库来解决，不需要引入复杂的加速方案。