pprof

321 阅读9分钟

查看当前活跃的goroutine

  • 个数
runtime.NumGoroutine()
  • 具体的

www.cnblogs.com/lanyangsh/p…

在程序中引入pprof package:

import _ "net/http/pprof"

程序中开启HTTP监听服务:


package main

import (
        _ "net/http/pprof"
        "net/http"

)

func main() {

        go func() {
                log.Println(http.ListenAndServe("localhost:6060", nil))
        }()

        select{}
}

访问 http://127.0.0.1:6060/debug/pprof/goroutine?debug=1

pprof

zhuanlan.zhihu.com/p/506035291

pprof源自 Google Perf Tools,在早期的Go版本中就有集成。pprof 由两部分组成:

  • runtime/pprof 自动被编译进每个Go程序的 runtime/pprof
  • go tool pprof 用来调查分析 profile 文件

Profile 类型

  • cpu
  • goroutine - stack traces of all current goroutines
  • threadcreate - stack traces that led to the creation of new OS threads
  • heap - a sampling of memory allocations of live objects
  • allocs - a sampling of all past memory allocations
  • block - stack traces that led to blocking on synchronization primitives
  • mutex - stack traces of holders of contended mutexes

默认情况下是不追踪block和mutex的信息的,如果想要看这两个信息,需要在代码中加上两行: runtime.SetBlockProfileRate(1) // 开启对阻塞操作的跟踪,block
runtime.SetMutexProfileFraction(1) // 开启对锁调用的跟踪,mutex

CPU profiling

当开启 CPU profiling,运行时 (the runtime) 将每 秒 打断它自己 n 次,并记录运行中的 goroutines 的堆栈轨迹 (stack trace)。

Profile 结束后,我们就可以用它来分析最热代码路径是怎样的。

如果一个函数出现在 profile 的次数越多,表示占整体运行时间的比重越大

实现

// setcpuprofilerate sets the CPU profiling rate to hz times per second.
// If hz <= 0, setcpuprofilerate turns off CPU profiling.
func setcpuprofilerate(hz int32) {
func setProcessCPUProfiler(hz int32) {
   setProcessCPUProfilerTimer(hz)
}

每隔一段时间(rate)在向当前 g 所在线程发送一个 SIGPROF 信号量

由于 GMP 的模型设计, 在绝大多数情况下通过这种 timer + sig + current thread 以及当前支持的抢占式调度, 这种记录方式是能够很好进行整个 runtime cpu profile 采样分析的, 但也不能排除一些极端情况是无法被覆盖的, 毕竟也只是基于当前 M 而已.

Memory Profiling

Memory profiling 会记录堆分配 (heap allocation) 时的堆栈轨迹

Memory profiling 就像 CPU profiling 一样都是基于采样实现。默认 MemProfileRate 是 512 KB, 可以由用户自行配置.

Block Profiling

block profiling 记录的是一个 goroutine 等待共享资源 (shared resouce) 的耗时。Block profiling 可以发现是否有大量的 goroutines 可以工作,但是都堵塞 (block) 住了(阻塞在chan、select、mutex上)

在chan send/recv, select, semacquire1、notifyListWait 中调用

Mutex Profiling

Mutex profiling 和 Block profiling 类似,不过它专注在 mutex 竞争导致的排他性操作引入的延迟。

Profiling a Go program 

一次只做一种 profile (One profile at a time)

go test

只需要开启下面其中一个标志参数就可以生成各种分析文件。当同时使用多个标志参数时需要当心,因为一项分析操作可能会影响其他项的分析结果。

go test -cpuprofile cpu.prof

go test -memprofile mem.prof

go test -blockprofile block.prof

go test -mutexprofile mutex.prof

一旦我们已经收集到了用于分析的采样数据,我们就可以使用pprof来分析这些数据。这是Go工具箱自带的一个工具,但并不是一个日常工具,它对应go tool pprof命令。该命令有许多特性和选项,但是最基本的是两个参数:生成这个概要文件的可执行程序和对应的剖析数据。

func TestBasic(t *testing.T) {
   for i := 0; i < 200; i++ { // running it many times
      if Fib2(30) != 832040 {
         t.Error("Incorrect!")
      }
   }
}

到test文件所在文件夹下执行

go test -run TestBasic -cpuprofile cpu.prof

go tool pprof cpu.prof

输入 top5 -cum

The topN function shown top N entries and the -cum flag shows the cumulative time taken.

接口访问

There is also a standard HTTP interface to profiling data. Adding the following line will install handlers under the /debug/pprof/ URL to download live profiles:

import _ "net/http/pprof"

访问 http://host:port/debug/pprof

image.png

也可以直接这样操作: go tool pprof http://host:port/debug/pprof/heap

go tool pprof -png http://host:port/debug/pprof/heap > heap.png //生成png文件

推荐使用导出文件方式,虽然步骤繁琐,但是有文件落地,保证重要数据不会丢失

//导出为文件
curl -o heap.out http://IP:Port/debug/pprof/heap

//解析文件并进入命令行交互
go tool pprof heap.out

//后续操作就和命令行直接分析如出一辙
//top 10 
//list funcName

内存定位示例

blog.csdn.net/weixin_4556…

curl -o heap.out http://IP:Port/debug/pprof/heap

go tool pprof heap.out

top 10//过滤出占用堆内存前10的方法

image.png

  • flat:表示此函数分配、并由该函数持有的内存空间。
  • cum:表示由这个函数或它调用堆栈下面的函数分配的内存总量。

list main.testReadAll //查看方法详情 image.png

最后定位到ioutil.ReadAll这个方法占用了太多内存。

熟悉的朋友都清楚,ioutil.ReadAll是直接将文件或者流数据一次性读取到内存里。如果文件过大或者多个请求同事读取多个文件,会直接将服务器内存打爆。 因为我们的客户有几百M的图片,所以一旦并发以上来很可能打爆。因此这里需要改成流式的io.Copy。

定位到问题后,直接改用流式方式给前端返回。

_, err = io.Copy(ctx.ResponseWriter(), file)

cpu profile

http://host:port/debug/pprof/profile?seconds=30 

This command specifies that pprof should collect 30 seconds of profiling data from the running application

然后再用 go tool pprof 分析文件

visualization

  • Nodes: Represent functions or methods.
  • Edges: Indicate the invocation between these functions.
  • Colors: Denote the amount of time or memory consumed.

源码

net/http/pprof 包init方法里面注册了这些路由

func init() {
   http.HandleFunc("/debug/pprof/", Index)
   http.HandleFunc("/debug/pprof/cmdline", Cmdline)
   http.HandleFunc("/debug/pprof/profile", Profile)//cpu profile
   http.HandleFunc("/debug/pprof/symbol", Symbol)
   http.HandleFunc("/debug/pprof/trace", Trace)
}

/debug/pprof/后支持这些profile

profiles.m = map[string]*Profile{
   "goroutine":    goroutineProfile,
   "threadcreate": threadcreateProfile,
   "heap":         heapProfile,
   "allocs":       allocsProfile,
   "block":        blockProfile,
   "mutex":        mutexProfile,
}

trace

当 runtime 出现瓶颈,比如 goroutine 调度延迟,GC STW 过大,可以通过 trace 帮助我们查看 runtime 细节。

curl host/debug/pprof/trace?seconds=10 > trace.out

这里我们生成 10s 内的数据,然后通过

go tool trace trace.out 

如果数据量很大我们要等待一段时间,然后会在浏览器打开一个新的 tab ,里面的数据非常有用。我们可以看到 gc 的时间,STW 的影响,函数的调用栈,goroutine 的调度。

image.png

源码

// A Profile is a collection of stack traces showing the call sequences
// that led to instances of a particular event, such as allocation.
// Packages can create and maintain their own profiles; the most common
// use is for tracking resources that must be explicitly closed, such as files
// or network connections.
//
// A Profile's methods can be called from multiple goroutines simultaneously.
//
// Each Profile has a unique name. A few profiles are predefined:
//
// goroutine    - stack traces of all current goroutines
// heap         - a sampling of memory allocations of live objects
// allocs       - a sampling of all past memory allocations
// threadcreate - stack traces that led to the creation of new OS threads
// block        - stack traces that led to blocking on synchronization primitives
// mutex        - stack traces of holders of contended mutexes
// The heap profile tracks both the allocation sites for all live objects in
// the application memory and for all objects allocated since the program start.
// Pprof's -inuse_space, -inuse_objects, -alloc_space, and -alloc_objects
// flags select which to display, defaulting to -inuse_space (live objects,
// scaled by size).
// The allocs profile is the same as the heap profile but changes the default
// pprof display to -alloc_space, the total number of bytes allocated since
// the program began (including garbage-collected bytes).
// writeHeap writes the current runtime heap profile to w.
func writeHeap(w io.Writer, debug int) error {
   return writeHeapInternal(w, debug, "")
}

// writeAlloc writes the current runtime heap profile to w
// with the total allocation space as the default sample type.
func writeAlloc(w io.Writer, debug int) error {
   return writeHeapInternal(w, debug, "alloc_space")
}
var (
   mbuckets  *bucket // memory profile buckets
   bbuckets  *bucket // blocking profile buckets
   xbuckets  *bucket // mutex profile buckets
   ...
   )
// A bucket holds per-call-stack profiling information.
// The representation is a bit sleazy, inherited from C.
// This struct defines the bucket header. It is followed in
// memory by the stack words and then the actual record
// data, either a memRecord or a blockRecord.
//
// Per-call-stack profiling information.
// Lookup by hashing call stack into a linked-list hash table.
//
// No heap pointers.
//
//go:notinheap
type bucket struct {
   next    *bucket
   allnext *bucket
   typ     bucketType // memBucket or blockBucket (includes mutexProfile)
   hash    uintptr
   size    uintptr
   nstk    uintptr
}
type memRecord struct {
   // The following complex 3-stage scheme of stats accumulation
   // is required to obtain a consistent picture of mallocs and frees
   // for some point in time.
   // The problem is that mallocs come in real time, while frees
   // come only after a GC during concurrent sweeping. So if we would
   // naively count them, we would get a skew toward mallocs.
   //
   // Hence, we delay information to get consistent snapshots as
   // of mark termination. Allocations count toward the next mark
   // termination's snapshot, while sweep frees count toward the
   // previous mark termination's snapshot:
   //
   //              MT          MT          MT          MT
   //             .·|         .·|         .·|         .·|
   //          .·˙  |      .·˙  |      .·˙  |      .·˙  |
   //       .·˙     |   .·˙     |   .·˙     |   .·˙     |
   //    .·˙        |.·˙        |.·˙        |.·˙        |
   //
   //       alloc → ▲ ← free
   //               ┠┅┅┅┅┅┅┅┅┅┅┅P
   //       C+2     →    C+1    →  C
   //
   //                   alloc → ▲ ← free
   //                           ┠┅┅┅┅┅┅┅┅┅┅┅P
   //                   C+2     →    C+1    →  C
   //
   // Since we can't publish a consistent snapshot until all of
   // the sweep frees are accounted for, we wait until the next
   // mark termination ("MT" above) to publish the previous mark
   // termination's snapshot ("P" above). To do this, allocation
   // and free events are accounted to *future* heap profile
   // cycles ("C+n" above) and we only publish a cycle once all
   // of the events from that cycle must be done. Specifically:
   //
   // Mallocs are accounted to cycle C+2.
   // Explicit frees are accounted to cycle C+2.
   // GC frees (done during sweeping) are accounted to cycle C+1.
   //
   // After mark termination, we increment the global heap
   // profile cycle counter and accumulate the stats from cycle C
   // into the active profile.

   // active is the currently published profile. A profiling
   // cycle can be accumulated into active once its complete.
   active memRecordCycle

   // future records the profile events we're counting for cycles
   // that have not yet been published. This is ring buffer
   // indexed by the global heap profile cycle C and stores
   // cycles C, C+1, and C+2. Unlike active, these counts are
   // only for a single cycle; they are not cumulative across
   // cycles.
   //
   // We store cycle C here because there's a window between when
   // C becomes the active cycle and when we've flushed it to
   // active.
   future [3]memRecordCycle
}

一些重要信息:

// Mallocs are accounted to cycle C+2.
// Explicit frees are accounted to cycle C+2.
// GC frees (done during sweeping) are accounted to cycle C+1.

...
// After mark termination, we increment the global heap
// profile cycle counter and accumulate the stats from cycle C
// into the active profile.
// memRecordCycle
type memRecordCycle struct {
   allocs, frees           uintptr
   alloc_bytes, free_bytes uintptr
}
// A blockRecord is the bucket data for a bucket of type blockProfile,
// which is used in blocking and mutex profiles.
type blockRecord struct {
   count  float64
   cycles int64
}

writeHeapInternal

debug!=0的时候,调用runtime.ReadMemStats(memStats)//短暂的stw去读取memstats

runtime 在内存分配的时候会根据一定策略进行采样, 记录到 mbuckets 中让用户得以进行分析, 而采样算法有个重要的依赖 MemProfileRate

// MemProfileRate controls the fraction of memory allocations 
// that are recorded and reported in the memory profile. 
// The profiler aims to sample an average of 
// one allocation per MemProfileRate bytes allocated. 

默认大小是 512 KB, 可以由用户自行配置.

goroutine

// writeGoroutine writes the current runtime GoroutineProfile to w.
func writeGoroutine(w io.Writer, debug int) error {
   if debug >= 2 {
      return writeGoroutineStacks(w)
   }
   return writeRuntimeProfile(w, debug, "goroutine", runtime_goroutineProfileWithLabels)
}

也调用了 stopTheWorld("profile")

threadcreate

pprof/threadcreate 具体实现和 pprof/goroutine 类似, 无非前者遍历的对象是全局 allm, 而后者为 allgs, 区别在于 pprof/threadcreate => ThreadCreateProfile 时不会进行进行 STW

mutex

mutex 默认是关闭采样的, 通过 runtime.SetMutexProfileFraction(int) 来进行 rate 的配置进行开启或关闭

和上文分析过的 mbuckets 类似, 这边用以记录采样数据的是 xbuckets, bucket 记录了锁持有的堆栈, 次数(采样)等信息以供用户查看

// A blockRecord is the bucket data for a bucket of type blockProfile,
// which is used in blocking and mutex profiles.
type blockRecord struct {
   count  float64
   cycles int64
}

mutexevent 在 semrelease1 中调用

block

image.png 在chan send/recv, select, semacquire1、notifyListWait 中调用

安全性:

生产环境可用 pprof, 注意接口不能直接暴露, 毕竟存在诸如 STW 等操作, 存在潜在风险点