引言
随着Kubernetes和容器化技术的普及,Go语言不仅在云原生基础组件领域广泛应用,也在各类业务场景中占据了重要地位。如今,越来越多的新兴业务选择Golang作为首选编程语言。得益于丰富的RPC框架(如Gin、Kratos、Kitex等),Golang在微服务生态中愈加成熟,并被用于很多重要的开源项目,如OpenTelemetry Collector、ETCD、Prometheus、Istio等。
但是跟Java相比,Golang在微服务生态上依然处于劣势,相比Java 可以使用字节码增强的技术来实现无侵入的应用监控能力,Golang没有成熟的对应方案,当前,大多数面向Golang应用的监控能力主要是通过SDK方式接入,如OTel SDK,需要开放人员手动进行埋点,手动埋点的方案就会存在以下的两个问题:
-
Trace需要每个调用点都需要进行埋点,同时要注意Trace上下文的传递,避免链路串联错误
-
Metrics统计,需要针对每次调用都进行统计,同时注意指标发散的问题
-
工作量非常大,对业务侵入性,每增加一个接口就需要同步增加对应的埋点
为了解决上述问题,可观测Go Agent应运而生。
可观测Agent架构
Java有JVM提供的基于字节码增强的能力可以进行无侵入的埋点,Golang没有类似的能力,因此这里我们是通过编译期注入的方案,在编译期完成埋点的注入,架构如下所示:
熟悉Golang编译流程的同学会比较熟悉,Go应用程序编译的大概流程如下所示:
-
创建临时目录,类似的语句为
mkdir -p $WORK/b088/。 -
查找依赖信息,类似的语句为
cat >/var/folders/7c/xvg9tyv929d1mqbd44ygh8400000gp/T/go-build2899987616/b104/importcfg << 'EOF' # internal,importcfg中包含所有的依赖信息。 -
compile编译出目标文件xxx.a
-
生成Link需要的配置文件,运行Link将上述目标文件转换为可执行文件
-
将可执行文件移到当前目录,删除临时目录
同时Go提供了-toolexec指定程序编译时候的工具,工具会在go build的时候介入编译过程,如下所示:
go build -toolexec xxx
那Go Agent如何在编译期完成埋点的注入呢,我们从以下几个方面来进行介绍:
查找埋点
一般微服务的代码非常多如何找到需要插入的点呢,这里使用了语法树的能力,通过语法树分析出来每个.go文件中的语法,下面介绍一下语法树如何使用的
-
使用Lexer词法分析器对源文件进行语法分析,生成一个Token
-
Parser解析器通过检索分析生存AST语法树
Go本身有提供上述这些库,如下所示:
go/scanner:词法解析,将源代码分割成一个个token
go/token:token类型及相关结构体定义
go/ast:ast的结构定义
go/parser:语法分析,读取token流生成ast
通过下面这个Demo进行AST的介绍:
package hello
import "fmt"
func greet() {
var msg = "Hello World!"
fmt.Println(msg)
}
分析后的结果如下所示:
0 *ast.File {
1 . Package: 2:1
2 . Name: *ast.Ident {
3 . . NamePos: 2:9
4 . . Name: "hello"
5 . }
6 . Decls: []ast.Decl (len = 2) {
7 . . 0: *ast.GenDecl {
8 . . . TokPos: 4:1
9 . . . Tok: import
10 . . . Lparen: -
11 . . . Specs: []ast.Spec (len = 1) {
12 . . . . 0: *ast.ImportSpec {
13 . . . . . Path: *ast.BasicLit {
14 . . . . . . ValuePos: 4:8
15 . . . . . . Kind: STRING
16 . . . . . . Value: "\"fmt\""
17 . . . . . }
18 . . . . . EndPos: -
19 . . . . }
20 . . . }
21 . . . Rparen: -
22 . . }
23 . . 1: *ast.FuncDecl {
24 . . . Name: *ast.Ident {
25 . . . . NamePos: 6:6
26 . . . . Name: "greet"
27 . . . . Obj: *ast.Object {
28 . . . . . Kind: func
29 . . . . . Name: "greet"
30 . . . . . Decl: *(obj @ 23)
31 . . . . }
32 . . . }
33 . . . Type: *ast.FuncType {
34 . . . . Func: 6:1
35 . . . . Params: *ast.FieldList {
36 . . . . . Opening: 6:11
37 . . . . . Closing: 6:12
38 . . . . }
39 . . . }
40 . . . Body: *ast.BlockStmt {
41 . . . . Lbrace: 6:14
42 . . . . List: []ast.Stmt (len = 2) {
43 . . . . . 0: *ast.DeclStmt {
44 . . . . . . Decl: *ast.GenDecl {
45 . . . . . . . TokPos: 7:5
46 . . . . . . . Tok: var
47 . . . . . . . Lparen: -
48 . . . . . . . Specs: []ast.Spec (len = 1) {
49 . . . . . . . . 0: *ast.ValueSpec {
50 . . . . . . . . . Names: []*ast.Ident (len = 1) {
51 . . . . . . . . . . 0: *ast.Ident {
52 . . . . . . . . . . . NamePos: 7:9
53 . . . . . . . . . . . Name: "msg"
54 . . . . . . . . . . . Obj: *ast.Object {
55 . . . . . . . . . . . . Kind: var
56 . . . . . . . . . . . . Name: "msg"
57 . . . . . . . . . . . . Decl: *(obj @ 49)
58 . . . . . . . . . . . . Data: 0
59 . . . . . . . . . . . }
60 . . . . . . . . . . }
61 . . . . . . . . . }
62 . . . . . . . . . Values: []ast.Expr (len = 1) {
63 . . . . . . . . . . 0: *ast.BasicLit {
64 . . . . . . . . . . . ValuePos: 7:15
65 . . . . . . . . . . . Kind: STRING
66 . . . . . . . . . . . Value: "\"Hello World!\""
67 . . . . . . . . . . }
68 . . . . . . . . . }
69 . . . . . . . . }
70 . . . . . . . }
71 . . . . . . . Rparen: -
72 . . . . . . }
73 . . . . . }
74 . . . . . 1: *ast.ExprStmt {
75 . . . . . . X: *ast.CallExpr {
76 . . . . . . . Fun: *ast.SelectorExpr {
77 . . . . . . . . X: *ast.Ident {
78 . . . . . . . . . NamePos: 8:5
79 . . . . . . . . . Name: "fmt"
80 . . . . . . . . }
81 . . . . . . . . Sel: *ast.Ident {
82 . . . . . . . . . NamePos: 8:9
83 . . . . . . . . . Name: "Println"
84 . . . . . . . . }
85 . . . . . . . }
86 . . . . . . . Lparen: 8:16
87 . . . . . . . Args: []ast.Expr (len = 1) {
88 . . . . . . . . 0: *ast.Ident {
89 . . . . . . . . . NamePos: 8:17
90 . . . . . . . . . Name: "msg"
91 . . . . . . . . . Obj: *(obj @ 54)
92 . . . . . . . . }
93 . . . . . . . }
94 . . . . . . . Ellipsis: -
95 . . . . . . . Rparen: 8:20
96 . . . . . . }
97 . . . . . }
98 . . . . }
99 . . . . Rbrace: 9:1
100 . . . }
101 . . }
102 . }
103 . FileStart: 1:1
104 . FileEnd: 9:3
105 . Scope: *ast.Scope {
106 . . Objects: map[string]*ast.Object (len = 1) {
107 . . . "greet": *(obj @ 27)
108 . . }
109 . }
110 . Imports: []*ast.ImportSpec (len = 1) {
111 . . 0: *(obj @ 12)
112 . }
113 . Unresolved: []*ast.Ident (len = 1) {
114 . . 0: *(obj @ 77)
115 . }
116 . GoVersion: ""
117 }
其中ast.Ident 表示包名,ast.GenDecl表示函数以外的所有声明,如import、const、var、type等关键字,ast.FuncDecl代表函数声明和函数的内部参数等。
代码插入
通过上述的词法分析就可以得出当前Golang服务中的代码编写情况,然后修改这些分析出来的语法树,将监控相关的逻辑如生成span添加到语法树中。
我们在Agent中提供了一个代码插入的框架,以下是插入框架对应的API,其中可以标注进行埋点的规则,如针对哪个SDK、哪个版本范围、哪个函数、哪个类进行埋点,埋点的前后代码的是什么。
package api
type InstrumentPriority int
const (
InstrumentPointDefault InstrumentPriority = 0
InstrumentPriorityLow InstrumentPriority = 0
InstrumentPriorityMedium InstrumentPriority = 1
InstrumentPriorityHigh InstrumentPriority = 2
)
type InstrumentRule struct {
Version string // Version of the rule, e.g. "v1.9.1" ====>[start,end)
PkgName string // Package name, e.g. "gin"
FullPkgName string // Full package name, e.g. "github.com/gin-gonic/gin"
FuncName string // Function name, e.g. "New"
RecvTypeName string // Receiver type name, e.g. "*gin.Engine" or "net/http.*Client"
Priority InstrumentPriority // Priority of the rule, indicates the order of the rule
OnEnter string // OnEnter callback
OnExit string // OnExit callback
}
type CallContext struct {
SkipCall bool
Context map[string]interface{}
}
类似go redis的埋点如下:
r1 := api.NewRule("github.com/redis/go-redis/v9", "NewFailoverClient", "", "", "afterNewFailOverRedisClient").WithVersion("[9.0.5,9.5.2)")
api.RegisterRule(r1)
其中afterNewFailOverRedisClient 就是我们想要插入到NewFailoverClient 函数中的代码,通过这个API我们非常方便的去定义我们的埋点方法,同时方便进行扩展。
混合编译
在查找到埋点的位置后,通过API完成埋点代码的插入,接下来就是进行混合编译阶段,编译过程中将插入的代码和已有的代码一起编译,编译完成后会生成对应的二进制文件,Go Agent编译代码的流程如下:
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 ./aliyun-go-agent
可观测Trace/Metrics能力
介绍完整个编译、插入流程后,我们将介绍一下在Go Agent中我们注入的Trace和Metrics能力,Trace、Metrics作为可观测领域最重要的2个部分(后续我们还会提供Profiling的能力),对应用的稳定性监控至关重要。
Trace
Trace埋点
Trace其实就是链路追踪,一次调用可以通过一条链路信息找到所有的调用的接口、延时等数据,如下所示:
在Go Agent中我们在每个调用的埋点的开始处调用tracer.Start()
ctx, span := tracer.Start(req.Context(), req.URL.Path, opts...)
在埋点结束时候调用span.End()
span.End()
Trace上下文透传
在同一个应用的不同的埋点中如何保障trace的上下文传递不会丢失呢,这里我们在goroutine中增加一个tls context变量,goroutine是通过以下的结构体描述的
type g struct {
// Stack parameters.
// stack describes the actual stack memory: [stack.lo, stack.hi).
// stackguard0 is the stack pointer compared in the Go stack growth prologue.
// It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
// stackguard1 is the stack pointer compared in the //go:systemstack stack growth prologue.
// It is stack.lo+StackGuard on g0 and gsignal stacks.
// It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
stack stack // offset known to runtime/cgo
stackguard0 uintptr // offset known to liblink
stackguard1 uintptr // offset known to liblink
_panic *_panic // innermost panic - offset known to liblink
_defer *_defer // innermost defer
m *m // current m; offset known to arm liblink
sched gobuf
syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
stktopsp uintptr // expected sp at top of stack, to check in traceback
// param is a generic pointer parameter field used to pass
// values in particular contexts where other storage for the
// parameter would be difficult to find. It is currently used
// in four ways:
// 1. When a channel operation wakes up a blocked goroutine, it sets param to
// point to the sudog of the completed blocking operation.
// 2. By gcAssistAlloc1 to signal back to its caller that the goroutine completed
// the GC cycle. It is unsafe to do so in any other way, because the goroutine's
// stack may have moved in the meantime.
// 3. By debugCallWrap to pass parameters to a new goroutine because allocating a
// closure in the runtime is forbidden.
// 4. When a panic is recovered and control returns to the respective frame,
// param may point to a savedOpenDeferState.
param unsafe.Pointer
atomicstatus atomic.Uint32
stackLock uint32 // sigprof/scang lock; TODO: fold in to atomicstatus
goid uint64
schedlink guintptr
waitsince int64 // approx time when the g become blocked
waitreason waitReason // if status==Gwaiting
preempt bool // preemption signal, duplicates stackguard0 = stackpreempt
preemptStop bool // transition to _Gpreempted on preemption; otherwise, just deschedule
preemptShrink bool // shrink stack at synchronous safe point
// asyncSafePoint is set if g is stopped at an asynchronous
// safe point. This means there are frames on the stack
// without precise pointer information.
asyncSafePoint bool
paniconfault bool // panic (instead of crash) on unexpected fault address
gcscandone bool // g has scanned stack; protected by _Gscan bit in status
throwsplit bool // must not split stack
// activeStackChans indicates that there are unlocked channels
// pointing into this goroutine's stack. If true, stack
// copying needs to acquire channel locks to protect these
// areas of the stack.
activeStackChans bool
// parkingOnChan indicates that the goroutine is about to
// park on a chansend or chanrecv. Used to signal an unsafe point
// for stack shrinking.
parkingOnChan atomic.Bool
// inMarkAssist indicates whether the goroutine is in mark assist.
// Used by the execution tracer.
inMarkAssist bool
coroexit bool // argument to coroswitch_m
raceignore int8 // ignore race detection events
nocgocallback bool // whether disable callback from C
tracking bool // whether we're tracking this G for sched latency statistics
trackingSeq uint8 // used to decide whether to track this G
trackingStamp int64 // timestamp of when the G last started being tracked
runnableTime int64 // the amount of time spent runnable, cleared when running, only used when tracking
lockedm muintptr
sig uint32
writebuf []byte
sigcode0 uintptr
sigcode1 uintptr
sigpc uintptr
parentGoid uint64 // goid of goroutine that created this goroutine
gopc uintptr // pc of go statement that created this goroutine
ancestors *[]ancestorInfo // ancestor information goroutine(s) that created this goroutine (only used if debug.tracebackancestors)
startpc uintptr // pc of goroutine function
racectx uintptr
waiting *sudog // sudog structures this g is waiting on (that have a valid elem ptr); in lock order
cgoCtxt []uintptr // cgo traceback context
labels unsafe.Pointer // profiler labels
timer *timer // cached timer for time.Sleep
selectDone atomic.Uint32 // are we participating in a select and did someone win the race?
coroarg *coro // argument during coroutine transfers
// goroutineProfiled indicates the status of this goroutine's stack for the
// current in-progress goroutine profile
goroutineProfiled goroutineProfileStateHolder
// Per-G tracer state.
trace gTraceState
// Per-G GC state
// gcAssistBytes is this G's GC assist credit in terms of
// bytes allocated. If this is positive, then the G has credit
// to allocate gcAssistBytes bytes without assisting. If this
// is negative, then the G must correct this by performing
// scan work. We track this in bytes to make it fast to update
// and check for debt in the malloc hot path. The assist ratio
// determines how this corresponds to scan work debt.
gcAssistBytes int64
}
我们通过编译时注入的方式,在其中增加一个变量,trace_tls用于保存trace的上下文信息。
trace_tls SpanContext
代码中我们直接使用tracer.start 时候,会通过埋点的方式去在g中查找trace的上下文信息,如果有新的goroutine创建,我们也会对newproc1 函数进行埋点,将父goroutine trace信息传递给子gouroutine,通过这样的方式确保了单条trace id对应的上下文都能串联在一起。
Metrics
指标的统计跟Trace类似,在每个埋点的地方对如调用次数、时间、错误、慢请求都进行记录,同时为了避免指标的发散带来的性能问题,我们通过指标收敛减少指标的数量, 通过下面两个收敛器完成指标收敛:
-
常规收敛器:负责根据输入规则直接转换输出,例如转换url,转换sql语句等。
-
限制收敛器:实现对收敛后的总值域大小的限制,其内部使用不同方式维护了一套有大小上限Limit的白名单。一般逻辑为,当白名单已满且要收敛的值不在白名单中时,触发收敛逻辑。
Go Agent Plugin
支持20个常见的微服务框架、中间件SDK等,同时对OTel SDK可以兼容。
plugin
仓库地址
低版本
高版本
1
net/http
v1.18
v1.21
2
go-restful
v3.7.0
v3.12.0
3
fasthttp
v1.50.0
v1.54.0
4
go-zero
v1.5.0
v1.6.5
5
echo
v4.11.4
v4.12.0
6
gin
v1.8.0
v1.9.0
7
mux
v1.8.1
8
dubbo
v3.0.1
v3.1.0
9
kratos
v2.5.2
v2.7.3
10
go-micro
v4.9.0
v4.11.0
11
grpc
v1.55.0
v1.64.0
12
go-redis
v9.0.3
v9.0.5
13
rocketmq-client-go
v2.1.0
v2.1.2
14
amqp
v1.9.0
v1.10.0
15
go标准库mysql
v1.18
v1.21
16
go-sql-driver
v1.4.0
v1.7.1
17
mongo
v1.11.1
v1.11.7
18
gorm
v1.22.0
v1.25.1
19
otel sdk
v1.6.0
v1.26.0
20
kitex
v0.9.0
v0.10.0
Go Agent兼容性
OTel SDK兼容
OTel SDK的兼容我们支持从v1.6.0版本到v1.26.0版本,在代码中如果已经使用OTel SDK添加埋点逻辑,如下所示,在代码中使用tracer.Start创建了自定义的span:
for {
tracer := otel.GetTracerProvider().Tracer("")
ctx, span := tracer.Start(context.Background(), "Client/User defined span")
for i := 0; i < 3; i++ {
otel.GetTextMapPropagator()
//req, err := http.NewRequestWithContext(ctx, "GET", "http://localhost:9000/http-service1", nil)
req, err := http.NewRequestWithContext(ctx, "GET", "http://otel-server:9000/http-service1", nil)
if err != nil {
fmt.Println(err.Error())
continue
}
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
fmt.Println(err.Error())
continue
}
defer resp.Body.Close()
b, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Println(err.Error())
continue
}
fmt.Println(string(b))
fmt.Println(resp.Header)
time.Sleep(time.Millisecond * 10)
}
sc := span.SpanContext()
fmt.Println(sc.TraceID())
fmt.Println(sc.SpanID())
span.SetAttributes(attribute.String("client", "client-with-ot"))
span.SetAttributes(attribute.Bool("user.defined", true))
span.End()
time.Sleep(time.Millisecond * 10)
}
Go Agent 在进行代码注入的时候,同样会hook OTel SDK的逻辑,在Agent上报的链路信息中也会包含用户代码自定义的逻辑:
Trace透传协议兼容
支持W3C、Jaeger、EagleEye、Zipkin协议透传。