kube-scheduler源码分析

2,745 阅读24分钟

背景知识

简介

kubernetes原生调度器与其说是一个调度器,不如说是一个pod的放置器,如同我们整理家务一样,把各种东西放进收纳盒内。而偶尔遇到收纳盒放满了,就拿出(驱逐)一些东西(pod)。

并且我们比较笨,看不到这些东西(pod)彼此的联系,如果你要求我所有手办放一起,所有零食放一起,但又不告诉我所有手办放哪儿(给个眼神自己体会?),我的脑子就转不过来了。我只能一件一件东西放置,虽然收纳的时候我能看到每个盒子里都有什么,但我并不知道还有哪些东西等着我整理。

这就是这样一个笨笨的,但十分高效的家务整理者的自白?

此外,在阅读kubernetes的源码之前,建议先学习client-go的源码,尤其是informer的机制。这是kubernetes架构中非常重要的内容,负责各个组件与apiserver的资源同步与事件处理。informer的知识可以参考informer机制解析

调度流程

参考社区中调度器的设计,可以把调度器的执行流程大致分为如下阶段:

  1. 选择一个pod
  2. 预选过滤
  3. 优选打分
  4. 选择分数最高的节点
  5. 绑定

演进

受限于容器基于namespacecgroup隔离资源的机制,整个kubernetes在多租户方面都无法实现如openstack般的隔离力度。调度器也一样,在多租户方面只能依赖于namespacequota等机制做软隔离。 并且由于调度器的调度粒度是pod,无法对一批任务做统筹的调度,而很多场景下,如:

  • AI训练任务
  • batch任务
  • 大数据任务

等时常要求系统对一批任务做统筹的处理,在批处理调度和多租户上的不足逐渐成为相关行业从业者的痛点。为此,社区推出了kube-batch作为解决方案,简单的来说,kube-batch就是一个支持多租户的批处理调度器。

kube-batch is a batch scheduler for Kubernetes, providing mechanisms for applications which would like to run batch jobs leveraging Kubernetes. It builds upon a decade and a half of experience on running batch workloads at scale using several systems, combined with best-of-breed ideas and practices from the open source community.

关于kube-batch,与衍生出volcano等调度器,有空在另一篇文章里解析。

社区的方案同时推动着upstream方案的迭代。在2018年,社区就propose了一个scheduler enhancement的提案,融合了批处理调度器的设计,将功能抽象plugin

The scheduling framework is a new set of "plugin" APIs being added to the existing Kubernetes Scheduler. Plugins are compiled into the scheduler, and these APIs allow many scheduling features to be implemented as plugins, while keeping the scheduling "core" simple and maintainable.

该方案若有后续进展,我也会持续跟进。

源码分析

我不会告诉你是因为用1.15版本才不看最新的代码的

时间 版本
2019.12 1.15

源码分析不会分析各种通用库的详细用法,如cobra, pflag, klog等等。

源码分为三个层次:

  1. cmd/kube-scheduler/scheduler.go:主函数
  2. pkg/scheduler/scheduler.go:在具体调度算法前的事件处理框架
  3. pkg/scheduler/core/generic_scheduler.go:调度算法

scheduler.go与整个调度器框架

从cmd/kube-scheduler/scheduler.go开始,经过一串trivial的pflag, cobra的代码,在cmd/kube-scheduler/app/server.go中我们找到runCommand()函数,这段代码只是调度器运行的准备工作,做了如下的事情

  1. 验证参数是否合法
  2. 创建stopCh作为调度器终止的信号
  3. 调用c, err := opts.Config()生成调度器配置
  4. 将未填充的参数填充默认值(其中还与apiserver的鉴权有关)
  5. applyFeatureGates()默认关闭或开启一些预选与优选算法。这一块目前来说非常凌乱,随着k8s版本变化而变化,感兴趣的童鞋可以参考kube-scheduler reference
  6. 进入本节的主角Run()函数。

简单的说一下一些小细节,在opts.Config()c := &schedulerappconfig.Config{}初始化了config结构体,值得注意的是InformerFactory informers.SharedInformerFactoryPodInformer coreinformers.PodInformer是独立的。在后面的代码中可以清晰的看到调度器需要监控如下资源:

  • pods
  • nodes
  • pv/pvc/storageClass
  • replicasController/replicasSet/statefulSet/service
  • PDB

然后代码中做了o.ApplyTo(c),即把options中的配置转换为config。这些代码比较trivial,多用到了一些更底层的库,值得注意的是在parse configfile的时候,kubernetes用到了runtime.DecodeInto(kubeschedulerscheme.Codecs.UniversalDecoder(), data, configObj),这个universalDecoder是kubernetes自己封装的反序列化库,事实上如果你用json.Unmarshal()也是可行的,但是毕竟工具都提供了,为啥不用呢。

之后createClients创建了三个客户端:leaderElectionClient和eventClient顾名思义,在创建client的时候,我们能看到用法与之前的略有不同,通常我们只需要clientset.NewForConfig(config),这里多了一步AddUserAgent(kubeConfig, "scheduler")。查阅注释可以发现

UserAgent is an optional field that specifies the caller of this request.

好的,说明了这个消息是scheduler发出来的。获得这些clientset之后,初始化了recorder用来记录event,之后特意把podInformer和其他的informer分开了,并且把resync设为了0。那么为什么要把podInformer单独列出来呢,说到底这些informer底层都是共享一份threadSafeStore的呀?

	c.InformerFactory = informers.NewSharedInformerFactory(client, 0)
	c.PodInformer = factory.NewPodInformer(client, 0)

看到NewPodInformer内部,原来是创建podInformer的时候额外加了一个selector,只监听非succeeded或者failed的pod,故listWatch函数和其他informer有所不同。

然后我们直接跳到Run()。这个函数是这部分的主角,主要分为以下部分:

  1. 新建一个调度器对象,其中包括新建各种informer
  2. 初始化事件广播、健康检查
  3. 启动各个informer
  4. 等待informer缓存与apiserver同步完成
  5. 调度器开始运行

有个小细节要留意一下,如果显示的生成了informer实例,需要在新的协程中运行,而informerFactory可以直接start()。看一下sched长什么样子

	sched, err := scheduler.New(cc.Client,
		cc.InformerFactory.Core().V1().Nodes(),
		cc.PodInformer,
		cc.InformerFactory.Core().V1().PersistentVolumes(),
		cc.InformerFactory.Core().V1().PersistentVolumeClaims(),
		cc.InformerFactory.Core().V1().ReplicationControllers(),
		cc.InformerFactory.Apps().V1().ReplicaSets(),
		cc.InformerFactory.Apps().V1().StatefulSets(),
		cc.InformerFactory.Core().V1().Services(),
		cc.InformerFactory.Policy().V1beta1().PodDisruptionBudgets(),
		cc.InformerFactory.Storage().V1().StorageClasses(),
		cc.Recorder,
		cc.ComponentConfig.AlgorithmSource,
		stopCh,
		framework.NewRegistry(),
		cc.ComponentConfig.Plugins,
		cc.ComponentConfig.PluginConfig,
		scheduler.WithName(cc.ComponentConfig.SchedulerName),
		scheduler.WithHardPodAffinitySymmetricWeight(cc.ComponentConfig.HardPodAffinitySymmetricWeight),
		scheduler.WithPreemptionDisabled(cc.ComponentConfig.DisablePreemption),
		scheduler.WithPercentageOfNodesToScore(cc.ComponentConfig.PercentageOfNodesToScore),
		scheduler.WithBindTimeoutSeconds(*cc.ComponentConfig.BindTimeoutSeconds))

一大串的初始化的参数,包括这么几类

  • informer
  • recorder记录event
  • 算法来源:默认是"Default Provider"
  • framework:这个非常重要,之后再说
  • plugin和pluginConfig:这个对应着之前提到的scheduler的plugin的改进
  • options:大多是一些辅助函数,这里面需要注意的有PercentageOfNodesToScore,其motivation是随着集群规模增大,调度器每轮调度的时延线性增加,为了加速预选,故预选中只要有百分之多少的节点可用,就结束预选。

接下来看一段和context包有关的代码:

	// Prepare a reusable runCommand function.
	run := func(ctx context.Context) {
		sched.Run()
		<-ctx.Done()
	}

	ctx, cancel := context.WithCancel(context.TODO()) // TODO once Run() accepts a context, it should be used here
	defer cancel()

	go func() {
		select {
		case <-stopCh:
			cancel()
		case <-ctx.Done():
		}
	}()

先看run函数,调用了sched.Run(),这个函数将不停在协程中运行scheduleOne,知道接收到stopChannel。此同时,用ctx.Done()阻塞住run的运行,直到ctx.Done()可以读取,run才会返回。一旦run()结束,调度器也停止运作,接下来就可以进行cobra中定义的一些后处理了。那么关键就在于context的逻辑了。ctx, cancel := context.WithCancel(context.TODO()),根据注释

WithCancel returns a copy of parent with a new Done channel.

得知这个parent context就是context.TODO(),再看TODO的用法,

TODO returns a non-nil, empty Context. Code should use context.TODO when it's unclear which Context to use or it is not yet available (because the surrounding function has not yet been extended to accept a Context parameter).

结合上面的run :=函数的赋值,即可以理解:由于run函数刚刚完成赋值,还没有真正接受context参数,故这里先建一个TODO的context。回到WithCancel

The returned context's Done channel is closed when the returned cancel function is called or when the parent context's Done channel is closed, whichever happens first.

Canceling this context releases resources associated with it, so code should call cancel as soon as the operations running in this Context complete.

它返回的ctx是这个context.TODO()的子context,之后在协程中select,如果stopCh可以读取,调用cancel()函数。这个cancel函数在WithCancel()中返回,它的作用是把context所有的children关闭并移除。如果ctx.Done()可以读取,将结束run()的运行,从而结束调度器。看一下cancel()的用法,在func (c *cancelCtx) cancel中,关键代码是

	if c.done == nil {
		c.done = closedchan
	} else {
		close(c.done)
	}

c.done是什么呢,看Done()的细节,如果c.done不为空,则返回c.done。也就是说我们可以粗略的认为这里面关闭了Done()要返回的channel。所以调用cancel()之后,由于从一个关闭中的channel中读取数据会出错,故同样会结束run()函数。故run(ctx)就不停的执行sched.scheduleOne,进行一轮一轮的调度。

总结一下到现在为止,我们直到了调度器的工作方式:通过informer获取一系列的资源,包括所有正在运行或pending的pod。然后不停的执行一轮一轮的调度(每一轮之间没有空隙时间)。接下来看一下调度器究竟长什么样。

scheduler各部分解析

New()中初始化了一个调度器的结构体type Scheduler,里面只包含一个config *factory.Config,这就为我们魔改与拓展提供了便利。config包含的内容里,我们要关注cache, algorithm, framework以及schedulingQueue。

// Config is an implementation of the Scheduler's configured input data.
// TODO over time we should make this struct a hidden implementation detail of the scheduler.
type Config struct {
	// It is expected that changes made via SchedulerCache will be observed
	// by NodeLister and Algorithm.
	SchedulerCache internalcache.Cache

	NodeLister algorithm.NodeLister
	Algorithm  core.ScheduleAlgorithm
	GetBinder  func(pod *v1.Pod) Binder
	// PodConditionUpdater is used only in case of scheduling errors. If we succeed
	// with scheduling, PodScheduled condition will be updated in apiserver in /bind
	// handler so that binding and setting PodCondition it is atomic.
	PodConditionUpdater PodConditionUpdater
	// PodPreemptor is used to evict pods and update 'NominatedNode' field of
	// the preemptor pod.
	PodPreemptor PodPreemptor
	// Framework runs scheduler plugins at configured extension points.
	Framework framework.Framework

	// NextPod should be a function that blocks until the next pod
	// is available. We don't use a channel for this, because scheduling
	// a pod may take some amount of time and we don't want pods to get
	// stale while they sit in a channel.
	NextPod func() *v1.Pod

	// WaitForCacheSync waits for scheduler cache to populate.
	// It returns true if it was successful, false if the controller should shutdown.
	WaitForCacheSync func() bool

	// Error is called if there is an error. It is passed the pod in
	// question, and the error
	Error func(*v1.Pod, error)

	// Recorder is the EventRecorder to use
	Recorder record.EventRecorder

	// Close this to shut down the scheduler.
	StopEverything <-chan struct{}

	// VolumeBinder handles PVC/PV binding for the pod.
	VolumeBinder *volumebinder.VolumeBinder

	// Disable pod preemption or not.
	DisablePreemption bool

	// SchedulingQueue holds pods to be scheduled
	SchedulingQueue internalqueue.SchedulingQueue
}

回到New(),跳过一些默认值的设置,首先看到生成了一个configurator,传进去的参数都是New()的参数,在内部我们能看到它初始化了调度器的缓存、framework、queue,相当于在config的基础上增加了一个更具体的运行配置。

之后,根据algorithm provider和policy,进入CreateFromProviderCreateFromConfig。要重点注意这两个函数最后都调用了CreateFromKeys,它最终返回一个config,这里很容易忽略Error字段,事实上这个字段看起来只是一个错误处理函数,实际上在调度器机制中占了重要的部分。后面我们会额外的解析这个Error字段。

	return &Config{
		SchedulerCache: c.schedulerCache,
		// The scheduler only needs to consider schedulable nodes.
		NodeLister:          &nodeLister{c.nodeLister},
		Algorithm:           algo,
		GetBinder:           getBinderFunc(c.client, extenders),
		PodConditionUpdater: &podConditionUpdater{c.client},
		PodPreemptor:        &podPreemptor{c.client},
		Framework:           c.framework,
		WaitForCacheSync: func() bool {
			return cache.WaitForCacheSync(c.StopEverything, c.scheduledPodsHasSynced)
		},
		NextPod:         internalqueue.MakeNextPodFunc(c.podQueue),
		Error:           MakeDefaultErrorFunc(c.client, c.podQueue, c.schedulerCache, c.StopEverything),
		StopEverything:  c.StopEverything,
		VolumeBinder:    c.volumeBinder,
		SchedulingQueue: c.podQueue,
	}, nil

获取这个config之后,就可以调用sched := NewFromConfig(config)生成调度器实例了。当然,还需要注册回调函数,见AddAllEventHandlers(sched, options.schedulerName, nodeInformer, podInformer, pvInformer, pvcInformer, serviceInformer, storageClassInformer),它负责informer与schedulerCache的联动。

cache

在进入internalcache.New(30*time.Second, stopEverything)之前先看一个golang独特的用法:stopEverything = wait.NeverStop,而var NeverStop <-chan struct{} = make(chan struct{}),从一个初始化的channel中读取数据,即字面意思上的Never stop。在New中干了两件事情,首先newSchedulerCache(30s, 1s, stop),规定30s为assumed pod失效时间。之后cache.run()

New returns a Cache implementation. It automatically starts a go routine that manages expiration of assumed pods. "ttl" is how long the assumed pod will get expired.

那么问题来了,什么是assumed pod?直接说结论:assumed pod是pod在bind行为真正发生前的状态,与bind操作异步,从而让调度器不用等待耗时的bind操作返回结果,提升调度器的效率。

cache包含nodes和nodeTree。nodes是一个双向链表,每个元素包含节点信息,pod信息,端口信息,资源信息等等。nodeTree则是树结构,每个节点的key是zone的名称,value是每个zone的节点名称。对zone不了解的童鞋可以参考multiple-zones。简单来说,nodeTree和调度策略有关,对我们理解调度机制其实不影响。

cache.run()以1s为间隔不停的调度cleanupExpiredAssumedPods,它的作用是监测cache中每个assumed pod是否完成了绑定。如果没有完成绑定,则不管,对完成绑定,且已经超时的pod,则从缓存中剔除,并更新cache的各个状态。

framework

framework.NewFramework(args.Registry, args.Plugins, args.PluginConfig)基本上把kube-batch的那套逻辑搬到了原生调度器中,通过waitingPods来实现gang scheduling,通过plugins.QueueSort支持多租户。

未来可以观望一下这样的extension是否被社区广泛接受。

queue

internalqueue.NewSchedulingQueue(stopEverything, framework)新建一个调度队列。我们暂时不考虑framework,所以可以先简单的看成nil。schedulingQueue实质上是一个priorityQueue

// PriorityQueue implements a scheduling queue.
// The head of PriorityQueue is the highest priority pending pod. This structure
// has three sub queues. One sub-queue holds pods that are being considered for
// scheduling. This is called activeQ and is a Heap. Another queue holds
// pods that are already tried and are determined to be unschedulable. The latter
// is called unschedulableQ. The third queue holds pods that are moved from
// unschedulable queues and will be moved to active queue when backoff are completed.
type PriorityQueue struct {
	stop  <-chan struct{}
	clock util.Clock
	// podBackoff tracks backoff for pods attempting to be rescheduled
	podBackoff *PodBackoffMap

	lock sync.RWMutex
	cond sync.Cond

	// activeQ is heap structure that scheduler actively looks at to find pods to
	// schedule. Head of heap is the highest priority pod.
	activeQ *util.Heap
	// podBackoffQ is a heap ordered by backoff expiry. Pods which have completed backoff
	// are popped from this heap before the scheduler looks at activeQ
	podBackoffQ *util.Heap
	// unschedulableQ holds pods that have been tried and determined unschedulable.
	unschedulableQ *UnschedulablePodsMap
	// nominatedPods is a structures that stores pods which are nominated to run
	// on nodes.
	nominatedPods *nominatedPodMap
	// schedulingCycle represents sequence number of scheduling cycle and is incremented
	// when a pod is popped.
	schedulingCycle int64
	// moveRequestCycle caches the sequence number of scheduling cycle when we
	// received a move request. Unscheduable pods in and before this scheduling
	// cycle will be put back to activeQueue if we were trying to schedule them
	// when we received move request.
	moveRequestCycle int64

	// closed indicates that the queue is closed.
	// It is mainly used to let Pop() exit its control loop while waiting for an item.
	closed bool
}

prioirityQueue通过堆排序实现,在pkg/scheduler/util/heap.go中提供了堆排序的实现。queue里面有三种队列

  • activeQ:等待调度的pod
  • podBackoffQ:如果pod尝试调度,但是调度失败,将成为backoff状态。这一点和unschedulable有点像,但backoff允许pod延迟一段时间后重试,延迟时间与重试次数有关。有点像controller中的workqueue的机制。backoffQ避免了高优先级但不可调度的的任务不停重试阻塞其他任务的问题。
  • unschedulableQ:pod尝试调度失败后,被判断为不可调度的pod

要注意IDE是:activeQ是一个heap,而unschedulableQ是一个UnschedulablePodsMap

// UnschedulablePodsMap holds pods that cannot be scheduled. This data structure
// is used to implement unschedulableQ.
type UnschedulablePodsMap struct {
	// podInfoMap is a map key by a pod's full-name and the value is a pointer to the PodInfo.
	podInfoMap map[string]*framework.PodInfo
	keyFunc    func(*v1.Pod) string
	// metricRecorder updates the counter when elements of an unschedulablePodsMap
	// get added or removed, and it does nothing if it's nil
	metricRecorder metrics.MetricRecorder
}

NewPriorityQueue中,能看到这样的调用unschedulableQ: newUnschedulablePodsMap(metrics.NewUnschedulablePodsRecorder())以及pq.podBackoffQ = util.NewHeapWithRecorder(podInfoKeyFunc, pq.podsCompareBackoffCompleted, metrics.NewBackoffPodsRecorder())。这两个用法都和prometheus有关系,可以用prometheus记录这两个队列的变化情况。

一个queue之内的pod优先级的比较算法为:以优先级为准,优先级相同时,创建时间早的优先级更高。多个Q之间的关系为:调度时从activeQ中取出pod,backoff的pod和unschedulablePod会在某些时候加入activeQ,具体可见

func (p *PriorityQueue) run() {
	go wait.Until(p.flushBackoffQCompleted, 1.0*time.Second, p.stop)
	go wait.Until(p.flushUnschedulableQLeftover, 30*time.Second, p.stop)
}

queue实现的关键方法有:

  • Add:将pod加入activeQ中,并检查它是否已经在unschedulableQ中,同时如果它在backoffQ中,将它从backoffQ中删除。之后把它加入nominatedPods中。该函数只能在新pod入队时使用。
  • AddIfNotPresent:只有pod不在任何一个Queue中,才将它加入activeQ,否则报错,不做任何处理。
  • AddUnschedulableIfNotPresent:将不在任何队列的不可调度的pod加入unschedulableQ中
  • Pop: 将activeQ中的首个元素取出,当activeQ为空时,阻塞直到其不为空。取出元素后,scheduleCycle值加1
  • flushBackoffQCompleted:将所有在backoffQ中的超过该pod该轮次的backoff时间的pod加入activeQ中
  • flushUnschedulableQLeftover:将所有在unschedualbleQ中停留时间超过阈值的pod重新加入activeQ中,该阈值写死为60秒
  • Update: 发生在pod有update操作时,如果在activeQ或者backoffQ中,则更新信息,如果pod在unschedualbeQ中,则加入activeQ

有个很tricky的点在于,如果我们通读了所有正确的处理逻辑,会发现没有地方用到了unschedulableQ和backoffQ。那么究竟是哪儿用到了这两个Queue呢?答案在错误处理里面。后面我们在通读了正确的逻辑之后,会详细的看错误处理。

现在我们先关注另一个问题,activeQ中的数据从哪儿来呢?之前已经说到了初始化调度器的时候为各个informer注册了回调函数,在pkg/scheduler/eventhandlers.go中,我们重点只关注podInformer,其他的后面我们会再次梳理。

	// scheduled pod cache
	podInformer.Informer().AddEventHandler(
		cache.FilteringResourceEventHandler{
			FilterFunc: func(obj interface{}) bool {
				switch t := obj.(type) {
				case *v1.Pod:
					return assignedPod(t)
				case cache.DeletedFinalStateUnknown:
					if pod, ok := t.Obj.(*v1.Pod); ok {
						return assignedPod(pod)
					}
					utilruntime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, sched))
					return false
				default:
					utilruntime.HandleError(fmt.Errorf("unable to handle object in %T: %T", sched, obj))
					return false
				}
			},
			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    sched.addPodToCache,
				UpdateFunc: sched.updatePodInCache,
				DeleteFunc: sched.deletePodFromCache,
			},
		},
	)
	// unscheduled pod queue
	podInformer.Informer().AddEventHandler(
		cache.FilteringResourceEventHandler{
			FilterFunc: func(obj interface{}) bool {
				switch t := obj.(type) {
				case *v1.Pod:
					return !assignedPod(t) && responsibleForPod(t, schedulerName)
				case cache.DeletedFinalStateUnknown:
					if pod, ok := t.Obj.(*v1.Pod); ok {
						return !assignedPod(pod) && responsibleForPod(pod, schedulerName)
					}
					utilruntime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, sched))
					return false
				default:
					utilruntime.HandleError(fmt.Errorf("unable to handle object in %T: %T", sched, obj))
					return false
				}
			},
			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    sched.addPodToSchedulingQueue,
				UpdateFunc: sched.updatePodInSchedulingQueue,
				DeleteFunc: sched.deletePodFromSchedulingQueue,
			},
		},
	)

这里对pod分为两类,一类是已经调度了的,限于v1版本的podPhase的限制,所以我们只能统一归类为running(即使它是Terminating)。这时候会直接与schedulerCache同步资源。如在AddFunc里调用了addPodToCache,里面又调用了sched.config.SchedulerCache.AddPod(pod)。对于没有调度过的pod,如pending状态,则与schedulingQueue同步。如sched.config.SchedulingQueue.Add

留一个小问题(之前已经说明过):succeeded与failed的pod去哪儿了呢?

algorithm

provider和policy都和算法有关。provider对我们比较重要,在createFromProvider()中能看到,它加载了预选函数、优选函数等。这些策略的具体默认配置在pkg/scheduler/algorithmprovider/defaults/defaults.go中,默认的预选函数有

  • NoVolumeZoneConflictPred
  • MaxEBSVolumeCountPred
  • MaxGCEPDVolumeCountPred
  • MaxAzureDiskVolumeCountPred
  • MaxCSIVolumeCountPred
  • MatchInterPodAffinityPred
  • NoDiskConflictPred
  • GeneralPred
  • CheckNodeMemoryPressurePred
  • CheckNodeDiskPressurePred
  • CheckNodePIDPressurePred
  • CheckNodeConditionPred
  • PodToleratesNodeTaintsPred
  • CheckVolumeBindingPred

默认的优选函数有

  • SelectorSpreadPriority
  • InterPodAffinityPriority
  • LeastRequestedPriority
  • BalancedResourceAllocation
  • NodePreferAvoidPodsPriority
  • NodeAffinityPriority
  • TaintTolerationPriority
  • ImageLocalityPriority

不考虑错误处理的调度流程

首先我们先不考虑错误处理,先看最简单的流程。由于调度是一轮一轮的,每一轮都会调用scheduleOne()函数。如果我们忽略framework,它首先从activeQ中取出一个pod,如果pod的DeletionTimestamp不为空,则直接返回,因为这个pod已经被下达了删除命令。之后将调用sched.schedule尝试模拟放置该pod。如果模拟失败,则尝试抢占sched.preempt。如果模拟成功,将pod置为assumed状态,并执行reserve操作。reserve的定义和kube-batch中的reserve有所不同

resources on a node are being reserved for a given Pod. This happens before the scheduler actually binds the pod to the Node, and it exists to prevent race conditions while the scheduler waits for the bind to succeed.

话虽如此,然而目前各个plugin还没有进行开发,所以我们暂时可以忽略这句reserve。assume之后,开始尝试异步绑定。忽略所有的plugin相关的代码,核心部分是sched.bind()。bind的意思是真正将pod交给节点的kubelet进行初始化。绑定结果返回后,将通知cache该pod的绑定已经结束。如果绑定失败,从缓存中清除该pod。

schedule

首先进行一些基本检查podPassesBasicChecks,其实检查的是如果使用了pvc,该pvc是否存在;pvc是否正在删除。

然后会做一次node的list:nodes, err := nodeLister.List(),这次操作是从nodeInformer的缓存中取出的,这份缓存和scheduler cache不同,它通过list watch机制与etcd保持同步;而scheduler cache需要我们手动维护。在获取所有node之后,做了一次g.snapshot()。它将遍历cache的所有node,更新node的信息。那么为什么不直接在Informer的cache中操作呢?因为首先,Informer的cache和etcd有关系,如果直接修改cache,将很难管理整个集群的节点状态;其次,由于bind操作是异步的,在bind结果返回前,我们并不知道节点的最终状态,所以需要在cache中记录,以维持调度器的正常运转;真实的节点信息则等待informer的同步。

之后做了预选g.findNodesThatFit(pod, nodes)。整体的策略是对每个node,引用之前定义的预选函数进行筛选,选出可以调度的节点。通过workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode),兵分16路进行各节点的预选,即每个节点执行一次checkNode。我们仔细的看一下用法。

// ParallelizeUntil is a framework that allows for parallelizing N
// independent pieces of work until done or the context is canceled.
func ParallelizeUntil(ctx context.Context, workers, pieces int, doWorkPiece DoWorkPieceFunc) {
	var stop <-chan struct{}
	if ctx != nil {
		stop = ctx.Done()
	}

	toProcess := make(chan int, pieces)
	for i := 0; i < pieces; i++ {
		toProcess <- i
	}
	close(toProcess)

	if pieces < workers {
		workers = pieces
	}

	wg := sync.WaitGroup{}
	wg.Add(workers)
	for i := 0; i < workers; i++ {
		go func() {
			defer utilruntime.HandleCrash()
			defer wg.Done()
			for piece := range toProcess {
				select {
				case <-stop:
					return
				default:
					doWorkPiece(piece)
				}
			}
		}()
	}
	wg.Wait()
}

定义stopCh为ctx.Done()。这个ctx在上一层调用中定义为ctx, cancel := context.WithCancel(context.Background())。context包的用法在informer机制中已经讲过了,这里不再赘述。有个问题要说明一下,为什么要加一个context呢?之前有讲到过为了加速预选,选出百分之多少的节点即可,在checkNode中,如果判断出已经选出这么多的节点,将执行cancel(),此时由于接收到stopCh,将返回结果。在ParallelizeUntil内部,初始化stopCh之后,初始化toProcess,这是一个buffer长度为节点数量的channel,并将channel填满。之后新建一个waitGroup,等待这16个workers处理完所有的任务或是接收到退出信号。每个worker会运行checkNode函数。在checkNode中,我们仔细看一下podFitsOnNode

	for i := 0; i < 2; i++ {
		metaToUse := meta
		nodeInfoToUse := info
		if i == 0 {
			podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(pod, meta, info, queue)
		} else if !podsAdded || len(failedPredicates) != 0 {
			break
		}
		for _, predicateKey := range predicates.Ordering() {
			var (
				fit     bool
				reasons []predicates.PredicateFailureReason
				err     error
			)
			//TODO (yastij) : compute average predicate restrictiveness to export it as Prometheus metric
			if predicate, exist := predicateFuncs[predicateKey]; exist {
				fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
				if err != nil {
					return false, []predicates.PredicateFailureReason{}, err
				}

				if !fit {
					// eCache is available and valid, and predicates result is unfit, record the fail reasons
					failedPredicates = append(failedPredicates, reasons...)
					// if alwaysCheckAllPredicates is false, short circuit all predicates when one predicate fails.
					if !alwaysCheckAllPredicates {
						klog.V(5).Infoln("since alwaysCheckAllPredicates has not been set, the predicate " +
							"evaluation is short circuited and there are chances " +
							"of other predicates failing as well.")
						break
					}
				}
			}
		}
	}

很神奇的是重复了两次。区别在于第一次会添加nominated pods,所谓nominated意味着pod被指定一定要调度到某些节点,更具体的说和抢占有关。注释里解释的很清楚:

We run predicates twice in some cases. If the node has greater or equal priority nominated pods, we run them when those pods are added to meta and nodeInfo. If all predicates succeed in this pass, we run them again when these nominated pods are not added. This second pass is necessary because some predicates such as inter-pod affinity may not pass without the nominated pods. If there are no nominated pods for the node or if the first run of the predicates fail, we don't run the second pass.

之后对预选函数进行排序,排序的目的是让复杂度低的预选函数优先进行,以减少预选时延。

预选完成后,进行优选PrioritizeNodes(),在pkg/scheduler/core/generic_scheduler.go。简单来说,优选的原理是在每个pod的优选中,对第n个节点,第k个策略,计算score(n,k)。最终计算score的加权合,作为每个节点应用所有优选策略后的总得分。

计算的时候用的是mapreduce的模式,map操作为results[i][index], err = priorityConfigs[i].Map(pod, meta, nodeInfo),reduce操作为priorityConfigs[index].Reduce(pod, meta, nodeNameToInfo, results[index])。map和reduce函数在注册优选函数的时候制定了。默认的函数在pkg/scheduler/algorithmprovider/defaults/register_priorities.go中。有的策略定义了MapReduceFunction,有的只定义了Function。所以我们可以在代码中看到,priorityConfigs[index].Function这样的计算方式。这一段的源码就补贴了,逻辑比较简单。

优选结束之后,选出一个节点进行后续的绑定。目前host, err := g.selectHost(priorityList)的策略非常简单,选择分最高的节。如果有多个节点打分相同,则kubernetes定义了一个选择的算法,这个算法也比较trivial,不展开讨论了。

preempt

调度器默认不允许抢占,不过我们仍然简单的看一下抢占的机制。抢占的入口在preempt()。如果允许抢占:

  1. 首先获取调度失败的这个pod的最新数据。
  2. 执行sched.config.Algorithm.Preempt(preemptor, sched.config.NodeLister, scheduleErr)获取要调度到的节点,要剔除的pod的列表,记为victims
  3. 更新该pod的nominatedNodeName到etcd中。
  4. 删除victims中所有pod

此时,由于etcd中的pod有变化,pod由informer的eventHandler处理后又加入了activeQueue,pod得以重新调度。需要注意的是,虽然进行了抢占,驱逐了一些pod,并标记了nominated node,但是并不保证一定调度到nominated node上,还是得看优选结果而定。

assume

assume一个pod意味着在将这个pod的nodeName设为通过预选、优选、节点选择这些步骤选择出的节点的名称,并加入cache中。

func (cache *schedulerCache) AssumePod(pod *v1.Pod) error {
	key, err := schedulernodeinfo.GetPodKey(pod)
	if err != nil {
		return err
	}

	cache.mu.Lock()
	defer cache.mu.Unlock()
	if _, ok := cache.podStates[key]; ok {
		return fmt.Errorf("pod %v is in the cache, so can't be assumed", key)
	}

	cache.addPod(pod)
	ps := &podState{
		pod: pod,
	}
	cache.podStates[key] = ps
	cache.assumedPods[key] = true
	return nil
}

assume完毕之后,开始异步的进行bind。

bind

跳过一堆的plugin,我们直接找到这一段bind的逻辑:

		err := sched.bind(assumedPod, &v1.Binding{
			ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
			Target: v1.ObjectReference{
				Kind: "Node",
				Name: scheduleResult.SuggestedHost,
			},
		})

bind的操作通过err := sched.config.GetBinder(assumed).Bind(b)完成。如果还记得scheduler初始化的过程的话,有一句GetBinder: getBinderFunc(c.client, extenders)。而在这个函数里面我们又能找到defaultBinder := &binder{client}。也就是说,默认情况下,其实GetBinder就返回了一个clientset,它将向apiserver发送bind请求。apiserver将处理这个请求,发送到节点的kubelet,由kubelet进行后续的操作,并返回结果。如果绑定失败,执行sched.config.SchedulerCache.ForgetPod(assumed)。forget操作将从cache中移除该pod。

总结

看完了不考虑错误处理的调度,我们可以总结出默认调度器的流程其实就是不停的从activeQ中取出pod,做预选、优选后取出分数最高的节点,进行异步绑定。如果绑定失败,则从缓存中丢弃该pod。

加上错误处理呢?

看到这儿我们会发现,unschedulableQ和backoffQ根本没有用到过。原因在于刚才我们分析的时候只分析了正确的逻辑,没有管错误处理。如果还能想起来初始化scheduler的操作,会发现里面很不显眼的注册了一个MakeDefaultErrorFunc

	return &Config{
		SchedulerCache: c.schedulerCache,
		// The scheduler only needs to consider schedulable nodes.
		NodeLister:          &nodeLister{c.nodeLister},
		Algorithm:           algo,
		GetBinder:           getBinderFunc(c.client, extenders),
		PodConditionUpdater: &podConditionUpdater{c.client},
		PodPreemptor:        &podPreemptor{c.client},
		Framework:           c.framework,
		WaitForCacheSync: func() bool {
			return cache.WaitForCacheSync(c.StopEverything, c.scheduledPodsHasSynced)
		},
		NextPod:         internalqueue.MakeNextPodFunc(c.podQueue),
		Error:           MakeDefaultErrorFunc(c.client, c.podQueue, c.schedulerCache, c.StopEverything),
		StopEverything:  c.StopEverything,
		VolumeBinder:    c.volumeBinder,
		SchedulingQueue: c.podQueue,
	}, nil

这个Error()在哪儿用到了呢?比如在recordSchedulingFailure中就调用了sched.config.Error(pod, err)。这个真的有点隐蔽,非常容易不小心就漏过去了。我们再找recordSchedulingFailure哪儿用到了?在assume, schedule, bind等等都调用了!不得不吐槽命名,命名写的是record,实际上偷偷摸摸的把任务加进backoffQ或者unschedulableQ……

找到pkg/scheduler/factory/factory.go的MakeDefaultErrorFunc

// MakeDefaultErrorFunc construct a function to handle pod scheduler error
func MakeDefaultErrorFunc(client clientset.Interface, podQueue internalqueue.SchedulingQueue, schedulerCache internalcache.Cache, stopEverything <-chan struct{}) func(pod *v1.Pod, err error) {
	return func(pod *v1.Pod, err error) {
		if err == core.ErrNoNodesAvailable {
			klog.V(4).Infof("Unable to schedule %v/%v: no nodes are registered to the cluster; waiting", pod.Namespace, pod.Name)
		} else {
			if _, ok := err.(*core.FitError); ok {
				klog.V(4).Infof("Unable to schedule %v/%v: no fit: %v; waiting", pod.Namespace, pod.Name, err)
			} else if errors.IsNotFound(err) {
				if errStatus, ok := err.(errors.APIStatus); ok && errStatus.Status().Details.Kind == "node" {
					nodeName := errStatus.Status().Details.Name
					// when node is not found, We do not remove the node right away. Trying again to get
					// the node and if the node is still not found, then remove it from the scheduler cache.
					_, err := client.CoreV1().Nodes().Get(nodeName, metav1.GetOptions{})
					if err != nil && errors.IsNotFound(err) {
						node := v1.Node{ObjectMeta: metav1.ObjectMeta{Name: nodeName}}
						schedulerCache.RemoveNode(&node)
					}
				}
			} else {
				klog.Errorf("Error scheduling %v/%v: %v; retrying", pod.Namespace, pod.Name, err)
			}
		}

		podSchedulingCycle := podQueue.SchedulingCycle()
		// Retry asynchronously.
		// Note that this is extremely rudimentary and we need a more real error handling path.
		go func() {
			defer runtime.HandleCrash()
			podID := types.NamespacedName{
				Namespace: pod.Namespace,
				Name:      pod.Name,
			}

			// An unschedulable pod will be placed in the unschedulable queue.
			// This ensures that if the pod is nominated to run on a node,
			// scheduler takes the pod into account when running predicates for the node.
			// Get the pod again; it may have changed/been scheduled already.
			getBackoff := initialGetBackoff
			for {
				pod, err := client.CoreV1().Pods(podID.Namespace).Get(podID.Name, metav1.GetOptions{})
				if err == nil {
					if len(pod.Spec.NodeName) == 0 {
						if err := podQueue.AddUnschedulableIfNotPresent(pod, podSchedulingCycle); err != nil {
							klog.Error(err)
						}
					}
					break
				}
				if errors.IsNotFound(err) {
					klog.Warningf("A pod %v no longer exists", podID)
					return
				}
				klog.Errorf("Error getting pod %v for retry: %v; retrying...", podID, err)
				if getBackoff = getBackoff * 2; getBackoff > maximalGetBackoff {
					getBackoff = maximalGetBackoff
				}
				time.Sleep(getBackoff)
			}
		}()
	}
}

除了一些node相关的错误,其他都会异步的重试这个pod。在一个新的协程里,首先查看pod是否还存在,如果不存在,返回即可。如果存在但是nodeName为空,则调用AddUnschedulableIfNotPresent。这一段逻辑有点欺骗性,因为进入到AddUnschedulableIfNotPresent内部之后,实际上是根据p.moveRequestCycle决定加入backoffQ还是unschedulableQ。

moveRequestCycle的值将在调用movePodsToActiveQueueMoveAllToActiveQueue中被设为schedulingCycle。这些条件的触发都在informers的event handler中,即如果资源发生变化,调度器认为这时候可能pod将变得可以调度,故需要把不可调度的pod或者backoff的pod再次加入activeQ中。具体的逻辑不再赘述。