Vtunes

including some basics of micro arch.

不同的采用方法，对应不同的overhead，user-sample（默认）的overhead比较大。
火焰图是所有线程在抽样点的callstack的集合。

是CPU time，而非elapsed time(CPU time> elapsed time)
所有线程，而非单一线程
sleep/wait并不会影响火焰图。

retired instructions 真正执行完的指令数量，排除掉乱序执行的时候，分支预测出错的指令。 www.intel.com/content/www…
CPU 利用率直方图

横轴是同时使用n个cpu的时间，越靠右说明利用率越高，n最大为逻辑核的数量
CPU的总时间并非直接相加，而是x*y累计求和。
sleep或者wait算在0(idle)里面，因为横轴为0，并不统计的CPU时间。

热点函数

统计的都是最上层（接近）的函数。这些时间求和为总时间。可以在bottle-up函数里找到。

流水线常规的指令执行有一下几个步骤：
指令取指（Instruction Fetch）
指令译码（Instruction Decode)
执行 (Execute)
Mem
写回（Write-Back）如果每个都占用一个circle的话，那执行一个instruction需要五个circle。但这里存在一个浪费，某个时间只有一个模块在工作，其他模块都处于空闲的状态，进而导致利用率很低。通过流水线pipeline的形式，当第一个指令跳到第二个步骤（ID），后面的指令就开始IF，这样在非常理想情况下，就能达到一个circle执行一条指令。

当下的cpu并非只有5个stage

intel cpu 四发射，一个circle可以发射4条指令，类似四个pipeline，所以理想情况是CPI是0.25

IPC vs CPI 两者互为倒数 Instructions per cycle; Cycles per instruction;
macro operations(uops) erik-engheim.medium.com/what-the-he…

uops是cpu内部的实现细节，而cisc 更像cpu的操作接口
decoder会把cisc分解成长度一致的uops(在一个circle里完成)，便于后面的reorder，并行优化。
uops 与 risc 指令不是一个东西

TLB www.gamedev.net/tutorials/p… en.wikipedia.org/wiki/Transl…

全称translation lookaside buffer（快表；旁路快表缓冲），cache一般由TLB与data组成，可以通过TLB查看当前缓存中有无数据。常用的场景是MMU（能够快速映射到最近的虚拟地址对应的物理地址）。

除了MMU，CPU与cache之间，不同level的cache之间，cache与主存之间都可以使用TLB。

micro arch

Instead of just processing a single instruction at the instruction pointer, the Pentium Pro processor could decode up to three instructions per cycle. Today's (circa 2008-2013) processors can decode up to four instructions at once. Decoding produces small fragments of operations called micro-ops or u-ops.

drivers

直接运行NDM的portable版本，需要先安装sampling driver

www.intel.com/content/www…

setenv

set Path is necessary, but if you can open gui directly, I think you can ignore that.

www.intel.com/content/www…

vtunes 学习笔记

Vtunes

drivers

setenv