Perf 学习笔记linux上perf工具，可以借此查看程序运行时，各种资源的实际占用率，从而改进代码，提升程序的性能

basics

方式一： perf status 直接显示结果
方式二： perf record + perf report 前者记录数据，后面查看相应的数据

这里以sudo perf stat ping 127.0.0.1 的输出结果为例

PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.023 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.058 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.062 ms
^C
--- 127.0.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2034ms
rtt min/avg/max/mdev = 0.023/0.047/0.062/0.019 ms

 Performance counter stats for 'ping 127.0.0.1':

              1.31 msec task-clock                #    0.000 CPUs utilized # cpu 真正运行的时间
                 5      context-switches          #    0.004 M/sec # 进程切换的次数
                 0      cpu-migrations            #    0.000 K/sec # cpu迁移的次数
                96      page-faults               #    0.073 M/sec # 页错误
         2,814,370      cycles                    #    2.148 GHz # 每秒的时钟数量
         1,799,925      instructions              #    0.64  insn per cycle # IPC  每个时钟（周期）执行的指令的数量
           357,091      branches                  #  272.573 M/sec  
            16,432      branch-misses             #    4.60% of all branches

       2.994683213 seconds time elapsed   # 程序的总时间

       0.001777000 seconds user          # 在user space的时间
       0.000000000 seconds sys            # 在kernal space的时间

工作频率（CPU 主频）: 代表每秒CPU的时钟数量（数字脉冲信号震荡）
IPC: 每个时钟可以执行的instructions
工作时间：不是程序运行时间，而是程序运行中cpu真正工作的时间。
执行的指令数量：工作时间工作频率IPC

常用命令

sudo perf list 可以查看所有支持的event 类型,
sudo perf stat -r 5 ls 重新多次运行
perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000 只查看cpu-cycle，保留想要的event
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000
perf stat -a ls 系统级别查看
perf record ./a.out 记录
perf report 读取当前目录下的文件

sandsoftwaresound.net/perf/perf-t…

生成火焰图

perf record -F 99 -a -g -- sleep 60 
perf script > out.perf
git clone --depth 1 https://github.com/brendangregg/FlameGraph.git

# 折叠调用栈
FlameGraph/stackcollapse-perf.pl out.perf > out.folded

# 生成火焰图
FlameGraph/flamegraph.pl out.folded > out.svg

火焰图只有编译时-g参数可以正确显示gaomf.cn/2019/10/30/…

同一层中，不同函数的顺序按名字排列，而不是调用顺序

长度越长，并非时间更多，而是更加占用cpu的资源，这是需要重点优化的。

sleep是不计算在里面的。

TEST(理解sleep)

通过perf stat ./a.out，发现约有1000的context switch。这是usleep是主动放弃当前的cpu的占用。

#include <iostream>
#include <cstdlib>
#include <unistd.h>
using namespace std;
int main()
{
  cout << "Hello ";
  usleep(10000);
  cout << "World";
  cout << endl; 
  return 0;
}

这里模拟一个计算密集型的程序，context switch很少，这里主要是因为时间片到了。

#include <iostream>
#include <cstdlib>
#include <unistd.h>
using namespace std;
int main()
{
  cout << "Hello ";
  for(int i=0;i!=123;)
{
i-=i*(i-1);
}
//usleep(1);
  cout << "World";
  cout << endl;

  return 0;
}

结论：主动放弃cpu（阻塞态）或者被动等时间片到了（就绪态）都会触发context switch。

TEST probe

perf probe -x ./a.out --funcs # 查看有哪些函数可以插入
perf probe -x a.out --add test # 插入probe
perf record -e probe_a:test -aR sleep 1 # 记录