查看当前活跃的goroutine
- 个数
runtime.NumGoroutine()
- 具体的
在程序中引入pprof package:
import _ "net/http/pprof"
程序中开启HTTP监听服务:
package main
import (
_ "net/http/pprof"
"net/http"
)
func main() {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
select{}
}
访问 http://127.0.0.1:6060/debug/pprof/goroutine?debug=1
pprof
zhuanlan.zhihu.com/p/506035291
pprof源自 Google Perf Tools,在早期的Go版本中就有集成。pprof 由两部分组成:
runtime/pprof自动被编译进每个Go程序的 runtime/pprofgo tool pprof用来调查分析 profile 文件
Profile 类型
- cpu
- goroutine - stack traces of all current goroutines
- threadcreate - stack traces that led to the creation of new OS threads
- heap - a sampling of memory allocations of live objects
- allocs - a sampling of all past memory allocations
- block - stack traces that led to blocking on synchronization primitives
- mutex - stack traces of holders of contended mutexes
默认情况下是不追踪block和mutex的信息的,如果想要看这两个信息,需要在代码中加上两行:
runtime.SetBlockProfileRate(1) // 开启对阻塞操作的跟踪,block
runtime.SetMutexProfileFraction(1) // 开启对锁调用的跟踪,mutex
CPU profiling
当开启 CPU profiling,运行时 (the runtime) 将每 秒 打断它自己 n 次,并记录运行中的 goroutines 的堆栈轨迹 (stack trace)。
Profile 结束后,我们就可以用它来分析最热代码路径是怎样的。
如果一个函数出现在 profile 的次数越多,表示占整体运行时间的比重越大。
实现
// setcpuprofilerate sets the CPU profiling rate to hz times per second.
// If hz <= 0, setcpuprofilerate turns off CPU profiling.
func setcpuprofilerate(hz int32) {
func setProcessCPUProfiler(hz int32) {
setProcessCPUProfilerTimer(hz)
}
每隔一段时间(rate)在向当前 g 所在线程发送一个 SIGPROF 信号量
由于 GMP 的模型设计, 在绝大多数情况下通过这种 timer + sig + current thread 以及当前支持的抢占式调度, 这种记录方式是能够很好进行整个 runtime cpu profile 采样分析的, 但也不能排除一些极端情况是无法被覆盖的, 毕竟也只是基于当前 M 而已.
Memory Profiling
Memory profiling 会记录堆分配 (heap allocation) 时的堆栈轨迹
Memory profiling 就像 CPU profiling 一样都是基于采样实现。默认 MemProfileRate 是 512 KB, 可以由用户自行配置.
Block Profiling
block profiling 记录的是一个 goroutine 等待共享资源 (shared resouce) 的耗时。Block profiling 可以发现是否有大量的 goroutines 可以工作,但是都堵塞 (block) 住了(阻塞在chan、select、mutex上)
在chan send/recv, select, semacquire1、notifyListWait 中调用
Mutex Profiling
Mutex profiling 和 Block profiling 类似,不过它专注在 mutex 竞争导致的排他性操作引入的延迟。
Profiling a Go program
一次只做一种 profile (One profile at a time)
go test
只需要开启下面其中一个标志参数就可以生成各种分析文件。当同时使用多个标志参数时需要当心,因为一项分析操作可能会影响其他项的分析结果。
go test -cpuprofile cpu.prof
go test -memprofile mem.prof
go test -blockprofile block.prof
go test -mutexprofile mutex.prof
一旦我们已经收集到了用于分析的采样数据,我们就可以使用pprof来分析这些数据。这是Go工具箱自带的一个工具,但并不是一个日常工具,它对应go tool pprof命令。该命令有许多特性和选项,但是最基本的是两个参数:生成这个概要文件的可执行程序和对应的剖析数据。
func TestBasic(t *testing.T) {
for i := 0; i < 200; i++ { // running it many times
if Fib2(30) != 832040 {
t.Error("Incorrect!")
}
}
}
到test文件所在文件夹下执行
go test -run TestBasic -cpuprofile cpu.prof
go tool pprof cpu.prof
输入
top5 -cum
The topN function shown top N entries and the -cum flag shows the cumulative time taken.
接口访问
There is also a standard HTTP interface to profiling data. Adding the following line will install handlers under the /debug/pprof/ URL to download live profiles:
import _ "net/http/pprof"
访问 http://host:port/debug/pprof
也可以直接这样操作: go tool pprof http://host:port/debug/pprof/heap
go tool pprof -png http://host:port/debug/pprof/heap > heap.png //生成png文件
推荐使用导出文件方式,虽然步骤繁琐,但是有文件落地,保证重要数据不会丢失
//导出为文件
curl -o heap.out http://IP:Port/debug/pprof/heap
//解析文件并进入命令行交互
go tool pprof heap.out
//后续操作就和命令行直接分析如出一辙
//top 10
//list funcName
内存定位示例
curl -o heap.out http://IP:Port/debug/pprof/heap
go tool pprof heap.out
top 10//过滤出占用堆内存前10的方法
- flat:表示此函数分配、并由该函数持有的内存空间。
- cum:表示由这个函数或它调用堆栈下面的函数分配的内存总量。
list main.testReadAll //查看方法详情
最后定位到ioutil.ReadAll这个方法占用了太多内存。
熟悉的朋友都清楚,ioutil.ReadAll是直接将文件或者流数据一次性读取到内存里。如果文件过大或者多个请求同事读取多个文件,会直接将服务器内存打爆。 因为我们的客户有几百M的图片,所以一旦并发以上来很可能打爆。因此这里需要改成流式的io.Copy。
定位到问题后,直接改用流式方式给前端返回。
_, err = io.Copy(ctx.ResponseWriter(), file)
cpu profile
http://host:port/debug/pprof/profile?seconds=30
This command specifies that pprof should collect 30 seconds of profiling data from the running application
然后再用 go tool pprof 分析文件
visualization
- Nodes: Represent functions or methods.
- Edges: Indicate the invocation between these functions.
- Colors: Denote the amount of time or memory consumed.
源码
net/http/pprof 包init方法里面注册了这些路由
func init() {
http.HandleFunc("/debug/pprof/", Index)
http.HandleFunc("/debug/pprof/cmdline", Cmdline)
http.HandleFunc("/debug/pprof/profile", Profile)//cpu profile
http.HandleFunc("/debug/pprof/symbol", Symbol)
http.HandleFunc("/debug/pprof/trace", Trace)
}
/debug/pprof/后支持这些profile
profiles.m = map[string]*Profile{
"goroutine": goroutineProfile,
"threadcreate": threadcreateProfile,
"heap": heapProfile,
"allocs": allocsProfile,
"block": blockProfile,
"mutex": mutexProfile,
}
trace
当 runtime 出现瓶颈,比如 goroutine 调度延迟,GC STW 过大,可以通过 trace 帮助我们查看 runtime 细节。
curl host/debug/pprof/trace?seconds=10 > trace.out
这里我们生成 10s 内的数据,然后通过
go tool trace trace.out
如果数据量很大我们要等待一段时间,然后会在浏览器打开一个新的 tab ,里面的数据非常有用。我们可以看到 gc 的时间,STW 的影响,函数的调用栈,goroutine 的调度。
源码
// A Profile is a collection of stack traces showing the call sequences
// that led to instances of a particular event, such as allocation.
// Packages can create and maintain their own profiles; the most common
// use is for tracking resources that must be explicitly closed, such as files
// or network connections.
//
// A Profile's methods can be called from multiple goroutines simultaneously.
//
// Each Profile has a unique name. A few profiles are predefined:
//
// goroutine - stack traces of all current goroutines
// heap - a sampling of memory allocations of live objects
// allocs - a sampling of all past memory allocations
// threadcreate - stack traces that led to the creation of new OS threads
// block - stack traces that led to blocking on synchronization primitives
// mutex - stack traces of holders of contended mutexes
// The heap profile tracks both the allocation sites for all live objects in
// the application memory and for all objects allocated since the program start.
// Pprof's -inuse_space, -inuse_objects, -alloc_space, and -alloc_objects
// flags select which to display, defaulting to -inuse_space (live objects,
// scaled by size).
// The allocs profile is the same as the heap profile but changes the default
// pprof display to -alloc_space, the total number of bytes allocated since
// the program began (including garbage-collected bytes).
// writeHeap writes the current runtime heap profile to w.
func writeHeap(w io.Writer, debug int) error {
return writeHeapInternal(w, debug, "")
}
// writeAlloc writes the current runtime heap profile to w
// with the total allocation space as the default sample type.
func writeAlloc(w io.Writer, debug int) error {
return writeHeapInternal(w, debug, "alloc_space")
}
var (
mbuckets *bucket // memory profile buckets
bbuckets *bucket // blocking profile buckets
xbuckets *bucket // mutex profile buckets
...
)
// A bucket holds per-call-stack profiling information.
// The representation is a bit sleazy, inherited from C.
// This struct defines the bucket header. It is followed in
// memory by the stack words and then the actual record
// data, either a memRecord or a blockRecord.
//
// Per-call-stack profiling information.
// Lookup by hashing call stack into a linked-list hash table.
//
// No heap pointers.
//
//go:notinheap
type bucket struct {
next *bucket
allnext *bucket
typ bucketType // memBucket or blockBucket (includes mutexProfile)
hash uintptr
size uintptr
nstk uintptr
}
type memRecord struct {
// The following complex 3-stage scheme of stats accumulation
// is required to obtain a consistent picture of mallocs and frees
// for some point in time.
// The problem is that mallocs come in real time, while frees
// come only after a GC during concurrent sweeping. So if we would
// naively count them, we would get a skew toward mallocs.
//
// Hence, we delay information to get consistent snapshots as
// of mark termination. Allocations count toward the next mark
// termination's snapshot, while sweep frees count toward the
// previous mark termination's snapshot:
//
// MT MT MT MT
// .·| .·| .·| .·|
// .·˙ | .·˙ | .·˙ | .·˙ |
// .·˙ | .·˙ | .·˙ | .·˙ |
// .·˙ |.·˙ |.·˙ |.·˙ |
//
// alloc → ▲ ← free
// ┠┅┅┅┅┅┅┅┅┅┅┅P
// C+2 → C+1 → C
//
// alloc → ▲ ← free
// ┠┅┅┅┅┅┅┅┅┅┅┅P
// C+2 → C+1 → C
//
// Since we can't publish a consistent snapshot until all of
// the sweep frees are accounted for, we wait until the next
// mark termination ("MT" above) to publish the previous mark
// termination's snapshot ("P" above). To do this, allocation
// and free events are accounted to *future* heap profile
// cycles ("C+n" above) and we only publish a cycle once all
// of the events from that cycle must be done. Specifically:
//
// Mallocs are accounted to cycle C+2.
// Explicit frees are accounted to cycle C+2.
// GC frees (done during sweeping) are accounted to cycle C+1.
//
// After mark termination, we increment the global heap
// profile cycle counter and accumulate the stats from cycle C
// into the active profile.
// active is the currently published profile. A profiling
// cycle can be accumulated into active once its complete.
active memRecordCycle
// future records the profile events we're counting for cycles
// that have not yet been published. This is ring buffer
// indexed by the global heap profile cycle C and stores
// cycles C, C+1, and C+2. Unlike active, these counts are
// only for a single cycle; they are not cumulative across
// cycles.
//
// We store cycle C here because there's a window between when
// C becomes the active cycle and when we've flushed it to
// active.
future [3]memRecordCycle
}
一些重要信息:
// Mallocs are accounted to cycle C+2.
// Explicit frees are accounted to cycle C+2.
// GC frees (done during sweeping) are accounted to cycle C+1.
...
// After mark termination, we increment the global heap
// profile cycle counter and accumulate the stats from cycle C
// into the active profile.
// memRecordCycle
type memRecordCycle struct {
allocs, frees uintptr
alloc_bytes, free_bytes uintptr
}
// A blockRecord is the bucket data for a bucket of type blockProfile,
// which is used in blocking and mutex profiles.
type blockRecord struct {
count float64
cycles int64
}
writeHeapInternal
debug!=0的时候,调用runtime.ReadMemStats(memStats)//短暂的stw去读取memstats
runtime 在内存分配的时候会根据一定策略进行采样, 记录到 mbuckets 中让用户得以进行分析, 而采样算法有个重要的依赖 MemProfileRate
// MemProfileRate controls the fraction of memory allocations
// that are recorded and reported in the memory profile.
// The profiler aims to sample an average of
// one allocation per MemProfileRate bytes allocated.
默认大小是 512 KB, 可以由用户自行配置.
goroutine
// writeGoroutine writes the current runtime GoroutineProfile to w.
func writeGoroutine(w io.Writer, debug int) error {
if debug >= 2 {
return writeGoroutineStacks(w)
}
return writeRuntimeProfile(w, debug, "goroutine", runtime_goroutineProfileWithLabels)
}
也调用了 stopTheWorld("profile")
threadcreate
pprof/threadcreate 具体实现和 pprof/goroutine 类似, 无非前者遍历的对象是全局 allm, 而后者为 allgs, 区别在于 pprof/threadcreate => ThreadCreateProfile 时不会进行进行 STW
mutex
mutex 默认是关闭采样的, 通过 runtime.SetMutexProfileFraction(int) 来进行 rate 的配置进行开启或关闭
和上文分析过的 mbuckets 类似, 这边用以记录采样数据的是 xbuckets, bucket 记录了锁持有的堆栈, 次数(采样)等信息以供用户查看
// A blockRecord is the bucket data for a bucket of type blockProfile,
// which is used in blocking and mutex profiles.
type blockRecord struct {
count float64
cycles int64
}
mutexevent 在 semrelease1 中调用
block
在chan send/recv, select, semacquire1、notifyListWait 中调用
安全性:
生产环境可用 pprof, 注意接口不能直接暴露, 毕竟存在诸如 STW 等操作, 存在潜在风险点