structured & contextual logging in Go

1,047 阅读4分钟

写在前面

这是针对自己公司的小组的一个分享

background

Traditionally logs were generated in the form of free form text messages prepended with some basic metadata such as the time of the event, severity and the source of the event. The advantages are:

  1. can be easily parsed
  2. easily stored in text files for humans to look at

On the other side...

  1. rigid metadata
  2. hard for filtering, searching, tracing context
  3. heavily depend on parsing, usually human-unreadable regrex

So what do we need?

  • structured: logging in structured format, e.g. JSON
  • contextual: logging with additional user-specified fields(context) automatically
  • leveled: e.g. INFO, ERROR, DEBUG, FATAL
  • severity: e.g. V(1), V(2)

Advantages?

  • free to customize metadata/specs
  • easy to track, search, filter
  • ease developers

Any drawbacks?

  • trade-off between out-of-the-box and context
  • not that easy to ready JSON-formatted logs directly

related work

Lots of tools...

status quo

Not that easy to migrate to structured logging due to

  • existed projects simply use klog for logging
  • lack resources(time, developer, QA)
  • logging is just assistance
  • seldom meet bugs that need painstaking tracing
  • learning cost

proposal

xial-thu/klog

  • almost compatible with klog
  • structured, contextual, leveled, severity logging
  • better performance by wrapping zap

usage

import

import this package by github.com/xial-thu/klog

initialization

for example, in main.go

// Unchanged
klog.InitFlags(nil)
flag.StringVar(&kubeconfig, "kubeconfig", "", "path to Kubernetes config file")
flag.Parse()

// Append a new API
klog.Singleton()

If Singleton() is not called, the default global no-ops logger will work, which means you are not able to see any real log.

Due to some gaps between klog and zap, parameters shall be converted, and the conversion must be done after flag.Parse(). klog.Singleton() inits an unique global logger whose configuration is slightly different from default zap production configuration at:

  1. Timekey is set to "time"
  2. EncodeTime is set to ISO8601TimeEncoder

flags

Not all flags defined in klog is supported, or rather say, not all the flags still make sense. Only alsologtostderr and v is supported currently.

  • v: still supports klog.V(2).Info() syntax. But the level of klog.Info() is INFO; that klog.V(3).Info() is DEBUG. If v is set to zero, zap DEBUG log will be ignored. Default to 0
  • alsologtostderr: default to true. If set to false, INFO and DEBUG log will only output to stdout

structured logging

There're 3 APIs:

  • With(): parse each field and value from input. With(struct{A string}{"hi"}) will output "A":"hi". If you care the fields in your struct and hope to extract them, use With()
  • WithAll(): sugar of zap.Any(). e.g. WithFields(struct{A string}{"hi"}) will output "":{"A":"hi"}. If you want to record the name of your struct, use WithAll()
  • WithFields(): e.g. WithFields("ID", 1, "name": "hi"), just another sugar of sugar.With()

Tips of With():

  1. Only struct or map will be accepted
  2. If arg is a struct, only exported field(which means "FieldName", not "fieldname") will be logged
  3. If arg is a map, only accept maps whose key is string
  4. If you have nested structs or maps, consider spliting them into several anomony parts
  5. If you don't want some fields to be logged automatically, unexport them

Some examples of With():

type S struct {
	A int
	B string
}
type Q struct {
	D ID
}
s := S{
	A: 10,
	B: "abc",
}
q := Q{
	D: ID(1),
}
c := ""
    
// struct args
With(s).Info(c) // "A":10,"B":"abc"
With(s, q).Info(c) // "A":10,"B":"abc","D":1
// anomony
With(struct {
	A int
	B int
}{1, 2}).Info(c) // "A":1,"B":2
// map args
With(map[string]int{"A":1}).Info(c) // "A":1

performance of With()

case ns/op B/op allocs/op
With every time 3121 2884 20
WithField every time 2239 1568 9
WithAll 3215 2642 14
With once 573 7 1

Due to reflect, With() is slow. It indicates us that it's better to write like this instead of parsing interfaces every time.

newLogger := klog.With(something)
newLogger.Info(something)

limitation

  1. default field is empty.
  2. structured logging still needs modification in business logic.
  3. support of flags needs more work.

Best Practice

If you don't need additional metadata

just replace k8s.io/klog with github.com/xial-thu/klog and add one single API: klog.Singleton()

Why it's not totally compatible?

  1. flags are not used directly, so something has to be done after flag.Parse()
  2. Though I can set default logger to a production-ready logger, we may still want some minor changes, such as time format and output files. So it's necessary to call an init function explicitly.

Add some flavor

Ideally, context is full lifecycle coverage. Just add the contexts at framework, not business logic. For example, if you have

func main() {
    handler := someFunc
    framework := NewFramework(handler)
    framework.Run()
}

add some flavor at NewFramework(), the goal is to pass logger into handler.

type context struct {}

func NewFramework(handler someFunc) *framework {
    ctx := context{}
    logger := klog.With(ctx)
    framework.logger = logger
    framework.AddHandler(handler)
}

So here're some tips:

  1. carefully design your logging metadata(context)
  2. carefully design your software architecture and try to avoid reverse-dependency

Assume we have

type request struct {
    httpContext HTTPContext
    encodedRuntimeObject []byte
}

Obviously, we'd better track runtime object by adding some fields in logger. But the question is: During framework level, we DO NOT know the object. So we pass the logger into business logic, waiting for the decoder to extract runtime information. To make things worse, if we're printing some logs in framework, that runtime object may be lost forever.

Some examples

kubectl-gpu

{"level":"info","time":"2020-04-20T10:19:51.169+0800","caller":"server/handler.go:140","msg":"receive delete event of weike/parrots-20200416--slurm-node-0"}

{"level":"warn","time":"2020-04-20T10:19:51.169+0800","caller":"server/handler.go:155","msg":"unrecommended FORCE-DELETE: resource has not been released before DELETE event"}

admission-webhook

{"level":"info","time":"2020-04-20T10:27:36.075+0800","caller":"validate/pod.go:48","msg":"failed validation: path /local-path-storage failed to pass any rules","Method":"POST","Host":"validate-webhook.phx-system.svc:443","Path":"/validate","Proto":"HTTP/2.0","UID":"0309715d-c910-4489-9b1c-03a7d8fbec0f","UserName":"system:serviceaccount:monitoring:local-path-provisioner","UserGroup":["system:serviceaccounts","system:serviceaccounts:monitoring","system:authenticated"],"Group":"","Version":"v1","Kind":"Pod","Resource":"pods","SubResource":"","Namespace":"monitoring","Operation":"CREATE","Name":"delete-pvc-7c17608d-d210-45fd-8d01-ca9b2945c5e1"}

How it benefits us? explore kibana!

filtering, searching and debugging

structured logging: admission request and response

regrex parsing: tracing info in APIServer. You DO NOT want to read the regrex. /^(?[IWE]+)(?[^ ]* [^ ]*) (?[^ ]*) (?[^\] ]*)\] (?:(?[A-Z]{2,}))?(?:(? [/]{1}[/A-Za-z0-9.-]{2,}))?(?:(?[\?]{1}[^\: ]*))?(?:\: \((?[^ \)]*))?(?:\) (?[0-9]{1,}))?*(?.*)$/

common logs: All info is included in f_message. Do filtering and searching using lucene grammar. Why bother?

full lifecycle tracking

use filters

Statistics & aggregation

  • visualize
  • dashboard

aggregation is the silver bullet.

The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being aggregated. The values are typically extracted from the fields of the document (using the field data), but can also be generated using scripts.

kibana supported aggregations:

  • extract from fields of docment: require field type to be 'number'. Sadly, '1.23ms' is NOT numerical field!
  • using script: do we really need it?

Proceed with caution

Please familiarize yourself with script fields and with scripts in aggregations before using scripted fields.

Scripted fields can be used to display and aggregate calculated values. As such, they can be very slow, and if done incorrectly, can cause Kibana to be unusable. There's no safety net here. If you make a typo, unexpected exceptions will be thrown all over the place!

So why not using metrics?