clusterloader2的漫漫踩坑路：最详细解析与使用指南上一篇文章kubernetes性能指标体系：SIG-sca

上一篇文章kubernetes性能指标体系：SIG-scalability, SLO, clusterloader2比较概括性的介绍了k8s的SLI/SLO，以及社区的测试工具。这篇文章会从源码级别分析clusterloader2，记录漫漫踩坑路。首先说明一下我使用的cl2是release 1.15分支，没有用最新的master分支，因为我的k8s版本是1.15.5。

配置文件到底是什么意思？

clone了perf-test库之后，进入到clusterloader2中，映入眼帘的一句话是

To run ClusterLoader type:

go run cmd/clusterloader.go --kubeconfig=kubeConfig.yaml --testconfig=config.yaml

打开config.yaml之后，我相信大部分人都会傻眼的。而且如果你选择自己搭建kubernetes集群，大概率cl2无法直接顺利运行。那么在踩坑之前，首先我们要了解：配置文件到底是什么？

这份配置文件描述了cl2会做哪些阶段的测试，每个阶段分别执行哪些动作，采集哪些数据？为了节省篇幅，这里不再重复design doc中的部分，在阅读完design doc之后，应该可以很顺畅的理解整个结构了。这里说一下density-test中的config文件的流程，在测试调度器的性能时，可以按照这个文件作为模板来测试。

The parameters in density-config-local.yaml is complicated. You must know its grammar at first so that you are able to adjust them.

Here's some grammar you must know:

{{$DENSITY_RESOURCE_CONSTRAINTS_FILE := DefaultParam .DENSITY_RESOURCE_CONSTRAINTS_FILE ""}} means the parameter DENSITY_RESOURCE_CONSTRAINTS_FILE is default to "" if it is not set. You can set it manually to override its default value
{{$MIN_LATENCY_PODS := 300}} just means setting the parameter to 300
{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}} means the number of namespaces is euqal to floor(nodes/node_per_namespace). NOTE that .Nodes MUST NOT be less than .NODES_PER_NAMESPACE
{{$podsPerNamespace := MultiplyInt $PODS_PER_NODE $NODES_PER_NAMESPACE}} is similar to grammer 3, but multiplying the params
{{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 1200}} means max(saturationDeploymentTimeout, 1200)

Then you must be familiar with the procedure so that you know what each parameter means. There's no silver bullet.

Below the parameters are the procedures of the testing. I'll explain them step by step.

name: density
automanagedNamespaces: {{$namespaces}}
tuningSets:
- name: Uniform5qps
  qpsLoad:
    qps: 5
{{if $ENABLE_CHAOSMONKEY}}
chaosMonkey:
  nodeFailure:
    failureRate: 0.01
    interval: 1m
    jitterFactor: 10.0
    simulatedDowntime: 10m
{{end}}

Don't mind the name and namespaces and tuningSets. In most cases you don't care them. chaosMonkey is an open-source software developed by Netflix to test the robustness of your system by shutting down your nodes at random and creating jitters. By default it is not enabled.

steps:
- name: Starting measurements
  measurements:
  - Identifier: APIResponsivenessPrometheus
    Method: APIResponsivenessPrometheus
    Params:
      action: start
  - Identifier: APIResponsivenessPrometheusSimple
    Method: APIResponsivenessPrometheus
    Params:
      action: start

steps is the procedures you defined. Each step might contain phases, measurements. Meansurement defines what you want to supervise or capture. Phase describes the attributes of some certain tasks. This config defines the following steps:

Starting measurements: don't care about what happens during preparation.
Starting saturation pod measurements: same as above
Creating saturation pods: the first case is saturation pods
Collecting saturation pod measurements
Starting latency pod measurements
Creating latency pods: the second case is latency pods
Waiting for latency pods to be running
Deleting latency pods
Waiting for latency pods to be deleted
Collecting pod startup latency
Deleting saturation pods
Waiting for saturation pods to be deleted
Collecting measurements

So we can see the testing mainly gathers measurements during the CRUD of saturation pods and latency pods:

saturation pods: pods in deployments with quite a large repliacas
latency pods: pods in deployments with one replicas

So you see the differences between the two modes. When saturation pods are created, replicas-controller in kube-controller-manager is handling one event. But in terms of latency pods, it's hundreds of events. But what's the difference anyway? It's because the various rate-limiter inside kubernetes affects the performance of scheduler and controller-manager.

In each case, what we're concerned is the number of pods, deployments and namespaces. We all know that kubernetes limits the pods/node, pods/namespace, so it's quite essential to adust relative parameters to achieve a reasonable load.

latency pods

Follow my math:

latency pods = namespaces * latencyReplicas
namespaces = nodes / nodes per namespace
nodes = avialable kubernetes nodes your cluster has
nodes per namespace is $NODES_PER_NAMESPACE in line 8
latencyReplicas = max(MIN LATENCY PODS, nodes) / namespaces
MIN LATENCY PODS is $MIN_LATENCY_PODS in line 18

saturation pods

Follow me:

saturation pods = namespaces * pods per namespace, this formula can be found in Creating Saturation pods step
pods per namespace = pods per node * nodes per namespace
pods per node is $PODS_PER_NODE in line 9
see the calculation of namespaces and nodes per namespace above in the part of latency pods

It's quite complicated. You have to be patient to figure out what shit is really happening. Here's some tips and regulations:

During the testing on local cluster, due to the fact the scale is small, we can set nodes/namespace = nodes so that there's only one namespace. It helps you simplify the math.
During the testing on kubemark, we're able to simulate hundreds of nodes, so it's better to have 2 or more namepsaces
The measurement pod start up latency only applies to latency pods, but not saturation pods, although it tells you the metric during saturation pods testing as well

Now you can set the parameters and do the testing. After a while(usually 5~10 min in local cluster testing), you can check out the results.

为什么这东西不能开箱即用？

我的kubernetes集群是自己安装的，没有用GCE或者kubeadm之类的工具，所以我遇到了大量的坑，对cl2做了许多hack，才勉强顺利的跑完了测试。

下面是我遇到的坑，与源码级别的解析。

SSH issue

cl2采集某些数据的时候需要ssh到master节点进行操作。例如见pkg/measurement/common/simple/scheduler_latency.go

cmd := "curl -X " + opUpper + " http://localhost:10251/metrics"
sshResult, err := measurementutil.SSH(cmd, host+":22", provider)

当你运行cl2的环境不能ssh到master节点时，程序不会终止，只会打印一些错误日志。所以我们需要保证：测试环境到集群的节点免密。

解决了免密的问题之后，ssh的时候的username是啥呢？是你当前的账号的username，并且不能通过指定任何flags来覆盖。我在公司集群上测试时，我的个人账号是没有ssh权限的，所以我只能用root权限运行。如果你在私人电脑上搭建的k8s集群，则可能可以避免这个问题。

dependency installation issues

这里主要指probes和prometheus组件。cl2的测试过程中，可以选择安装或不安装prometheus stack。在cmd/clusterloader.go中有如下代码

func initFlags() {
	flags.StringVar(&clusterLoaderConfig.ReportDir, "report-dir", "", "Path to the directory where the reports should be saved. Default is empty, which cause reports being written to standard output.")
	flags.BoolEnvVar(&clusterLoaderConfig.EnablePrometheusServer, "enable-prometheus-server", "ENABLE_PROMETHEUS_SERVER", false, "Whether to set-up the prometheus server in the cluster.")
	flags.BoolEnvVar(&clusterLoaderConfig.TearDownPrometheusServer, "tear-down-prometheus-server", "TEAR_DOWN_PROMETHEUS_SERVER", true, "Whether to tear-down the prometheus server after tests (if set-up).")
	flags.StringArrayVar(&testConfigPaths, "testconfig", []string{}, "Paths to the test config files")
	flags.StringArrayVar(&clusterLoaderConfig.TestOverridesPath, "testoverrides", []string{}, "Paths to the config overrides file. The latter overrides take precedence over changes in former files.")
	initClusterFlags()
}

默认不安装prometheus stack与probes，用户可以自行管理。注意：

如果enable-prometheus-server为false，那么tear-down-prometheus-server参数无效
安装的时候要注意与cl2对prometheus的配置一致。prometheus的yaml文件存放在pkg/prometheus/manifests中，用的是prometheus operator，且关键信息，比如端口、命名空间没有做修改。所以如果你也用prometheus operator安装，那么大概率你可以顺利的运行cl2，不会被卡在prometheus上
如果让cl2安装prometheus，坑就来了！cl2将读取$GOPATH来寻找prometheus manifests。即：它默认你没用go module，且系统的$GOPATH不为空。实际上我们很可能就是用的go module，而且很多系统中直接echo $GOPATH是没有值的！所以我还是推荐自己安装一套prometheus和probes
prometheus和probes必须安装，否则无法运行cl2测试

metrics grabber issue

先说一下我的环境：kube-scheduler, kube-controller-manager, kube-apiserver, kube-proxy, etcd都是二进制部署的，且etcd与kubelet禁止http访问。这就造成运行的时候一直错误。见pkg/measurement/common/simple/etcd_metrics.go

	// In https://github.com/kubernetes/kubernetes/pull/74690, mTLS is enabled for etcd server
	// http://localhost:2382 is specified to bypass TLS credential requirement when checking
	// etcd /metrics and /health.
	if samples, err := e.sshEtcdMetrics("curl http://localhost:2382/metrics", host, provider); err == nil {
		return samples, nil
	}

	// Use old endpoint if new one fails.
	return e.sshEtcdMetrics("curl http://localhost:2379/metrics", host, provider)

相应的，你必须检查一下该文件夹下其他组件的metrics获取方式，确保与你的环境保持一致。

除此之外，还有另一个很深的坑。在pkg/measurement/common/simple/metrics_for_e2e.go中创建了一个grabber，用来抓取各组件的metrics

grabber, err := metrics.NewMetricsGrabber(
		config.ClusterFramework.GetClientSets().GetClient(),
		nil, /*external client*/
		grabMetricsFromKubelets,
		true,  /*grab metrics from scheduler*/
		true,  /*grab metrics from controller manager*/
		true,  /*grab metrics from apiserver*/
		false /*grab metrics from cluster autoscaler*/)

这个grabber引用了vendor/k8s.io/kubernetes/test/e2e/framework/metrics/metrics_grabber.go的包，深入看这个包的时候，我们发现它抓取各组件的时候默认组件都是pod方式部署的。如vendor/k8s.io/kubernetes/test/e2e/framework/metrics/metrics_grabber.go

func (g *MetricsGrabber) GrabFromScheduler() (SchedulerMetrics, error) {
	if !g.registeredMaster {
		return SchedulerMetrics{}, fmt.Errorf("Master's Kubelet is not registered. Skipping Scheduler's metrics gathering.")
	}
	output, err := g.getMetricsFromPod(g.client, fmt.Sprintf("%v-%v", "kube-scheduler", g.masterName), metav1.NamespaceSystem, ports.InsecureSchedulerPort)
	if err != nil {
		return SchedulerMetrics{}, err
	}
	return parseSchedulerMetrics(output)
}

如果你以binary方式部署k8s组件，需要修改vendor的这个包。最惨的是如果你用master分支，它已经改成go module了，你还得自己开一个包重写一遍……

master node issue

cl2会自动判断哪个节点是master节点，判断方式在vendor/k8s.io/kubernetes/pkg/util/system/system_utils.go

// TODO: find a better way of figuring out if given node is a registered master.
// IsMasterNode checks if it's a master node, see http://gitlab.bj.sensetime.com/xialei1/perf-tests/issues/4
func IsMasterNode(node corev1.Node) bool {
	// We are trying to capture "master(-...)?$" regexp.
	// However, using regexp.MatchString() results even in more than 35%
	// of all space allocations in ControllerManager spent in this function.
	// That's why we are trying to be a bit smarter.
	name := node.Name
	if strings.HasSuffix(name, "master") {
		return true
	}
	return false
}

它通过是否有master这个后缀来判断是否是master node，嗯……kubeadm在安装之后会为master节点打上一个node-role.kubernetes.io/master=''的标签，而其他k8s安装方式也不一定要为master节点这样命名。我已经给社区提了相关的意见，见perf-tests #1191。

scheduler throughput issue

这个问题是我之前给社区提过的，现在master分支已经完成了修复。见perf-tests #1083。简而言之，就是他们错误的使用了平均吞吐量作为指标，而实际上他们一直用最大吞吐量作为指标。

相关代码见pkg/measurement/common/simple/scheduler_throughput.go

type schedulingThroughput struct {
	Average float64 `json:"average"`
	Perc50  float64 `json:"perc50"`
	Perc90  float64 `json:"perc90"`
	Perc99  float64 `json:"perc99"`
}

实际上应该有一个max值的。

这些指标到底是什么？

踩了上面的这个坑，做了无数hack与调试之后，终于把cl2跑起来了，跑完了发现一个问题：这些各个指标分别是什么？我如何从指标中定位瓶颈？

cl2的指标多从用户e2e的角度出发，每个指标涉及的流程与环节多，难以定位具体瓶颈。所以需要梳理每个指标涉及的具体流程，从什么时候开始，到什么时候结束

目前official的测试指标有三个

mutating api
readonly api
latency pod startup

pod startup latency

查询相对复杂，源码在pkg/measurement/common/slos/pod_startup_latency.go中，分为start和gather两个阶段，结果记录在podStartupEntries中，这是一个map[string]map[string]time的结构体，记录着每个pod的每个阶段的时间。

在start阶段，启动一个informer监听pod。每当检查到有pod处于running状态，且podStartupEntries中还没有该pod的记录时，在podStartupEntries中记录该pod：

watchPhase为time.Now()
createPhase为pod的creationTimeStamp
runPhase为pod的容器处于running状态的时间戳

在gather阶段，停止对pod的监听。之后遍历所有的event信息。scheduler在调度pod之后会记录一个event，类似

4d17h Normal Scheduled pod/sensestar-test-gvl92 Successfully assigned default/sensestar-test-gvl92 to sh-idc1-10-5-8-62

记录所有已经有记录的pod的schedulePhase为event的时间。之后汇总出指标：

"create_to_schedule": event中scheduled事件 - pod creationTimeStamp
"schedule_to_run": pod的容器处于running状态 - event中scheduled事件
"run_to_watch": informer接收到running pod - pod的container处于running
"schedule_to_watch": informer接收到running pod - event中scheduled事件
"pod_startup": informer接收到running pod - pod创建时间戳

这样的查询方式看起来有点怪异，但是可以分解出pod创建过程各个阶段的时延。其实在kubelet有现成的指标kubelet_pod_start_duration_seconds，通过下面的语句在prometheus中查询

histogram_quantile(0.99, sum(rate(kubelet_pod_start_duration_seconds_bucket[1h])) by (le))

额外说一下deployment中pod创建的流程

apiserver收到创建deployment的请求，存储至etcd，告知controller-manager
controller-manager创建pod的壳子，打上creationTimeStamp，发送请求到apiserver
apiserver收到创建pod的请求，发送至etcd，推送到scheduler。
schduler选择node，填充nodeName，向apiserver更新pod信息。此时pod处于pending状态，pod也没有真正创建。
apiserver向etcd更新pod信息，同时推送到相应节点的kubelet
kubelet创建pod，填充HostIP与resourceVersion，向apiserver发送更新请求，pod处于pending状态
apiserver更新pod信息至etcd，同时kubelet继续创建pod。等到容器都处于running状态，kubelet再次发送pod的更新请求给apiserver，此时pod running
apiserver收到请求，更新到etcd中，并推送到informer中，informer记录下watchPhase

mutating api与readonly api

cl2的查询语句：

histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{resource!="events", verb!~"WATCH|WATCHLIST|PROXY|proxy|CONNECT"}[20m])) by (resource, subresource, verb, scope, le))

在apiserver处抓取，其含义为

Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

从apiserver收到消息开始，到发送完回复为止。read-only api只涉及etcd，mutating-api可能涉及其他组件，不展开讨论。

etcd metrics

这里额外再加一个etcd，是因为etcd是我们做k8s性能调优的重点。其实结果的etcd指标很清晰，我只是想强调一下而已……

clusterloader2踩坑感想

我认为cl2的优点

对测试流程的建模
采集的指标比较全面

我觉得cl2的缺点

曲线略陡峭，对自己安装k8s集群的用户极度不友好……
入门文档有限
分支管理略混乱