开启 cpuprofile 反而提升了性能最近测试了 perf, go pprof, fgpprof 等不同性能分析工具

最近测试了 perf, go pprof, fgpprof 等不同性能分析工具对应用的影响。

发现一件奇怪的事：开启了 cpuprofile 后，性能反而升高了。

rpc benchmark

使用 github.com/bytedance/kitex-benchmark 做测试。

服务端 4 个core，客户端 12 个 core，通过 unarg grpc 总共发送两千万条 1024 byte 的数据。

应该有的测试考量都有：

warm，kitex-benchmark 自带预热。
nice, nice -20 调整优先级。
numactl, 控制只在指定 core 运行。
关闭了 turbo boost。
go 1.18 版本。

结果

不开启 cpu profiler， tps 85000。
开启 cpu profiler 后， tps 95000。

你一定觉得这不可能！

看看下面的 block profiler 吧

chan benchmark

使用 runtime/chan_test 中的 BenchmarkChanProdCons 函数做测试。

开启 procs 个 producer 和 consumer，通过 chan 发送一亿条消息。参数有：

chan size。 0,10,100
work。goroutine 是否每次循环都跑一些测试代码。

block

在 chan 大小为 0 的情况下，开启 block profiler 甚至让性能提高了！

benchmark                          old ns/op     new ns/op     delta
BenchmarkChanProdCons0-8           844           804           -4.80%
BenchmarkChanProdCons10-8          666           765           +14.72%
BenchmarkChanProdCons100-8         416           510           +22.52%
BenchmarkChanProdConsWork0-8       845           811           -4.05%
BenchmarkChanProdConsWork10-8      828           864           +4.25%
BenchmarkChanProdConsWork100-8     643           659           +2.50%

从 cpu profiler 对比图来看，可以发现：

saveblockevent 函数消耗的 cpu 变高了。
lock, unlock 函数消耗的 cpu 减少了，说明对 chan 的竞争减少了。

截屏2022-12-15 下午9.50.01.png

为什么 blockprofiler 会导致性能升高，原因还是个迷，

给人的经验是千万别先入为主。

trace

trace 对性能损耗和 chan 大小成反比。

benchmark                          old ns/op     new ns/op     delta
BenchmarkChanProdCons0-8           844           2986          +253.67%
BenchmarkChanProdCons10-8          666           1508          +126.26%
BenchmarkChanProdCons100-8         416           486           +16.71%
BenchmarkChanProdConsWork0-8       845           2982          +252.90%
BenchmarkChanProdConsWork10-8      828           1478          +78.42%
BenchmarkChanProdConsWork100-8     643           706           +9.70%

让我们梳理 chan send 的流程：

如果 chan 有 waiter，gounpark waiter。
没有 waiter， chan 还有空间，则塞进 chan 中。
gopark。

而 trace 会跟踪 gopark 和 gounpark 事件，因此在 cpu profiler 比较中。

chan 大小越大的，越容易走第二步的逻辑，因此 trace 引起的 cpu 消耗越低。

截屏2022-12-15 下午9.54.17.png

开启 cpuprofile 后 rpc 为啥会更快？

只能猜测，由于 cpu pprof 会让 goroutine 接收信号。

因此，可能减少了在 mutex 上的性能开销。