介绍
kube-scheduler 组件接收命令行参数,用传递的参数构造一个 Scheduler 对象,最终启动了调度器。
调度器启动后就可以开始为未调度的 Pod 进行调度操作了,本文主要来分析调度器是如何对一个 Pod 进行调度操作的。
调度队列
调度器启动后最终是调用 Scheduler 下面的 Run() 函数来开始调度 Pod,如下所示代码:
// pkg/scheduler/scheduler.go
// Run begins watching and scheduling. It starts scheduling and blocked until the context is done.
func (sched *Scheduler) Run(ctx context.Context) {
sched.SchedulingQueue.Run()
wait.UntilWithContext(ctx, sched.scheduleOne, 0)
sched.SchedulingQueue.Close()
}
开始执行 SchedulingQueue 的 Run() 函数,SchedulingQueue 是一个队列接口,用于存储待调度的 Pod,该接口遵循类似于 cache.FIFO 和 cache.Heap 这样的数据结构,要弄明白调度器是如何去调度 Pod 的,我们就首先需要弄清楚这个结构:
// pkg/scheduler/internal/queue/scheduling_queue.go
// SchedulingQueue is an interface for a queue to store pods waiting to be scheduled.
// The interface follows a pattern similar to cache.FIFO and cache.Heap and
// makes it easy to use those data structures as a SchedulingQueue.
// SchedulingQueue是队列存储等待调度的pod的接口。
// 该接口遵循类似于cache.FIFO和cache.Heap的模式,因此可以轻松地将这些数据结构用作SchedulingQueue。
type SchedulingQueue interface {
framework.PodNominator
Add(pod *v1.Pod) error
// Activate moves the given pods to activeQ iff they're in unschedulablePods or backoffQ.
// The passed-in pods are originally compiled from plugins that want to activate Pods,
// by injecting the pods through a reserved CycleState struct (PodsToActivate).
// 激活将给定的pod移动到activeQ,如果它们位于unscheduleablepods或backoffQ中。
// 传入的pod最初是从想要激活pod的插件中编译的,方法是通过保留的CycleState结构 (PodsToActivate) 注入pod。
Activate(pods map[string]*v1.Pod)
// AddUnschedulableIfNotPresent adds an unschedulable pod back to scheduling queue.
// The podSchedulingCycle represents the current scheduling cycle number which can be
// returned by calling SchedulingCycle().
// AddUnschedulableIfNotPresent 将无法调度的pod添加回调度队列
// podSchedulingCycle 表示可以通过调用SchedulingCycle() 返回的当前调度周期号。
AddUnschedulableIfNotPresent(pod *framework.QueuedPodInfo, podSchedulingCycle int64) error
// SchedulingCycle returns the current number of scheduling cycle which is
// cached by scheduling queue. Normally, incrementing this number whenever
// a pod is popped (e.g. called Pop()) is enough.
// SchedulingCycle 返回由调度队列缓存的当前调度周期数。
// 通常,只要弹出一个 Pod(例如调用 Pop() 函数),就增加此数字。
SchedulingCycle() int64
// 下面是通用队列的相关操作
// Pop removes the head of the queue and returns it. It blocks if the
// queue is empty and waits until a new item is added to the queue.
// Pop删除队列的头并返回。如果队列为空,它会阻塞,并等待直到将新项目添加到队列中。
Pop() (*framework.QueuedPodInfo, error)
Update(oldPod, newPod *v1.Pod) error
Delete(pod *v1.Pod) error
MoveAllToActiveOrBackoffQueue(event framework.ClusterEvent, preCheck PreEnqueueCheck)
AssignedPodAdded(pod *v1.Pod)
AssignedPodUpdated(pod *v1.Pod)
PendingPods() []*v1.Pod
// Close closes the SchedulingQueue so that the goroutine which is
// waiting to pop items can exit gracefully.
// 关闭 SchedulingQueue,以便等待pop元素的 goroutine 可以正常退出
Close()
// Run starts the goroutines managing the queue.
// 启动队列的 goroutine
Run()
}
SchedulingQueue 是一个用于存储带调度 Pod 的队列接口,在构造 Scheduler 对象的时候我们可以了解到调度器中是如何实现这个队列接口的:
// pkg/scheduler/internal/queue/scheduling_queue.go
// NewSchedulingQueue initializes a priority queue as a new scheduling queue.
// NewSchedulingQueue将优先级队列初始化为新的调度队列。
func NewSchedulingQueue(
lessFn framework.LessFunc,
informerFactory informers.SharedInformerFactory,
opts ...Option) SchedulingQueue {
return NewPriorityQueue(lessFn, informerFactory, opts...)
}
// NewPriorityQueue creates a PriorityQueue object.
// NewPriorityQueue创建PriorityQueue对象。
func NewPriorityQueue(
lessFn framework.LessFunc,
informerFactory informers.SharedInformerFactory,
opts ...Option,
) *PriorityQueue {
options := defaultPriorityQueueOptions
for _, opt := range opts {
opt(&options)
}
comp := func(podInfo1, podInfo2 interface{}) bool {
pInfo1 := podInfo1.(*framework.QueuedPodInfo)
pInfo2 := podInfo2.(*framework.QueuedPodInfo)
return lessFn(pInfo1, pInfo2)
}
if options.podNominator == nil {
options.podNominator = NewPodNominator(informerFactory.Core().V1().Pods().Lister())
}
pq := &PriorityQueue{
PodNominator: options.podNominator,
clock: options.clock,
stop: make(chan struct{}),
podInitialBackoffDuration: options.podInitialBackoffDuration,
podMaxBackoffDuration: options.podMaxBackoffDuration,
podMaxInUnschedulablePodsDuration: options.podMaxInUnschedulablePodsDuration,
activeQ: heap.NewWithRecorder(podInfoKeyFunc, comp, metrics.NewActivePodsRecorder()),
unschedulablePods: newUnschedulablePods(metrics.NewUnschedulablePodsRecorder()),
moveRequestCycle: -1,
clusterEventMap: options.clusterEventMap,
}
pq.cond.L = &pq.lock
pq.podBackoffQ = heap.NewWithRecorder(podInfoKeyFunc, pq.podsCompareBackoffCompleted, metrics.NewBackoffPodsRecorder())
pq.nsLister = informerFactory.Core().V1().Namespaces().Lister()
return pq
}
从上面的初始化过程可以看出来 PriorityQueue 这个优先级队列实现了 SchedulingQueue 接口,所以真正的实现还需要去查看这个优先级队列:
// pkg/scheduler/internal/queue/scheduling_queue.go
// PriorityQueue implements a scheduling queue.
// The head of PriorityQueue is the highest priority pending pod. This structure
// has two sub queues and a additional data structure, namely: activeQ,
// backoffQ and unschedulablePods.
// - activeQ holds pods that are being considered for scheduling.
// - backoffQ holds pods that moved from unschedulablePods and will move to
// activeQ when their backoff periods complete.
// - unschedulablePods holds pods that were already attempted for scheduling and
// are currently determined to be unschedulable.
// PriorityQueue 实现了调度队列 SchedulingQueue
// PriorityQueue 的头部元素是优先级最高的 pending pod,该结构有三个子队列:
// 一个子队列包含正在考虑进行调度的 Pod,称为 activeQ,是一个堆
// 另一个队列包含已尝试并且确定为不可调度的 Pod,称为 unschedulableQ
// 第三个队列包含从 unschedulableQ 队列移出的 Pod,并在 backoff 完成后将其移到 activeQ 队列
type PriorityQueue struct {
// PodNominator abstracts the operations to maintain nominated Pods.
framework.PodNominator
stop chan struct{}
clock util.Clock
// pod initial backoff duration.
// pod 初始 backoff 时间
podInitialBackoffDuration time.Duration
// pod maximum backoff duration.
// pod 最大 backoff 时间
podMaxBackoffDuration time.Duration
// the maximum time a pod can stay in the unschedulablePods.
//pod在unschedulepods中停留的最长时间
podMaxInUnschedulablePodsDuration time.Duration
lock sync.RWMutex
cond sync.Cond
// activeQ is heap structure that scheduler actively looks at to find pods to
// schedule. Head of heap is the highest priority pod.
// activeQ是堆结构,调度程序会主动查看以查找要调度的pod。heap的头是优先级最高的pod。
activeQ *heap.Heap
// podBackoffQ is a heap ordered by backoff expiry. Pods which have completed backoff
// are popped from this heap before the scheduler looks at activeQ
// podBackoffQ是按backoff Expire排序的堆。在调度程序查看activeQ之前,已完成回退的POD将从该堆中弹出
podBackoffQ *heap.Heap
// unschedulablePods holds pods that have been tried and determined unschedulable.
// Unscheduleablepods保存已经尝试过并确定不可调度的pod。
unschedulablePods *UnschedulablePods
// schedulingCycle represents sequence number of scheduling cycle and is incremented
// when a pod is popped.
// Scheduingcycle表示调度周期的序号,在pod弹出时递增。
schedulingCycle int64
// moveRequestCycle caches the sequence number of scheduling cycle when we
// received a move request. Unschedulable pods in and before this scheduling
// cycle will be put back to activeQueue if we were trying to schedule them
// when we received move request.
// 当我们收到移动请求时,moveRequestCycle会缓存调度周期的序列号。
// 如果我们在收到移动请求时尝试对其进行调度,则此调度周期中及之前的不可调度的pod将返回activeQueue。
moveRequestCycle int64
clusterEventMap map[framework.ClusterEvent]sets.String
// closed indicates that the queue is closed.
// It is mainly used to let Pop() exit its control loop while waiting for an item.
// closed表示队列关闭
// 它主要用于在等待项目时让Pop() 退出其控制循环。
closed bool
nsLister listersv1.NamespaceLister
}
这里使用的是一个 PriorityQueue 优先级队列来存储待调度的 Pod,普通队列是一个 FIFO 数据结构,根据元素进入队列的顺序依次出队,而对于调度的这个场景,优先级队列显然更合适,可以根据某些优先级策略,优先对某个 Pod 进行调度。
PriorityQueue 的头部元素是优先级最高的带调度的 Pod,该结构有三个子队列:
- 活动队列(activeQ):用来存放等待调度的 Pod 队列。
- 不可调度队列(unschedulableQ):当 Pod 不能满足被调度的条件的时候就会被加入到这个不可调度的队列中来,等待后续继续进行尝试调度。
- 回退队列(podBackOffQ):如果任务反复执行还是失败,则会按尝试次数增加等待调度时间,降低重试效率,从而避免反复失败浪费调度资源。对于调度失败的 Pod 会优先存储在 backoff 队列中,等待后续进行重试,可以认为就是重试的队列,只是后续再调度的等待时间会越来越长。
活动队列
**活动队列(activeQ)**是存储当前系统中所有在等待调度的 Pod 队列,在上面实例化优先级队列里面可以看到 activeQ 队列的初始化是通过调用 heap.NewWithRecorder() 函数实现的。
// pkg/scheduler/internal/heap/heap.go
// NewWithRecorder wraps an optional metricRecorder to compose a Heap object.
// NewWithRecorder包装一个可选的 metricRecorder 来组成一个堆对象。
// 就是 Heap 基础上包装了 metrics 数据
func NewWithRecorder(keyFn KeyFunc, lessFn lessFunc, metricRecorder metrics.MetricRecorder) *Heap {
return &Heap{
data: &data{
items: map[string]*heapItem{},
queue: []string{},
keyFunc: keyFn,
lessFunc: lessFn,
},
metricRecorder: metricRecorder,
}
}
// lessFunc is a function that receives two items and returns true if the first
// item should be placed before the second one when the list is sorted.
//lessFunc 接收两个元素,对列表进行排序时,将第一个元素放在第二个元素之前,则返回true。
type lessFunc = func(item1, item2 interface{}) bool
其中的 data 数据结构是 Golang 中的一个标准 heap 堆(只需要实现 heap.Interface 接口即可),然后 Heap 是在 data 基础上新增了一个用于记录 metrics 数据的堆,这里最重要的就是用比较元素优先级的 lessFunc 函数的实现,在初始化优先级队列的时候我们传入了一个 comp 的参数,这个参数就是 activeQ 这个堆里面的 lessFunc 函数的实现:
// pkg/scheduler/internal/queue/scheduling_queue.go
comp := func(podInfo1, podInfo2 interface{}) bool {
pInfo1 := podInfo1.(*framework.QueuedPodInfo)
pInfo2 := podInfo2.(*framework.QueuedPodInfo)
return lessFn(pInfo1, pInfo2)
}
最终是调用的创建 Scheduler 对象的时候传入的 lessFn 参数:
lessFn := profiles[c.profiles[0].SchedulerName].Framework.QueueSortFunc()
从这里可以看到比较元素优先级是通过调度框架的 QueueSortFunc() 函数来实现的,对应的实现如下所示:
// pkg/scheduler/framework/runtime/framework.go
// QueueSortFunc returns the function to sort pods in scheduling queue
// QueueSortFunc返回对调度队列中的pod进行排序的函数
func (f *frameworkImpl) QueueSortFunc() framework.LessFunc {
if f == nil {
// If frameworkImpl is nil, simply keep their order unchanged.
// NOTE: this is primarily for tests.
// 如果frameworkImpl为nil,只需保持它们的顺序不变。
// 注意: 这主要是为了测试。
return func(_, _ *framework.QueuedPodInfo) bool { return false }
}
// 如果没有 queueSort插件
if len(f.queueSortPlugins) == 0 {
panic("No QueueSort plugin is registered in the frameworkImpl.")
}
// Only one QueueSort plugin can be enabled.
// 只能启用一个QueueSort插件。
return f.queueSortPlugins[0].Less
}
最终真正用于优先级队列元素优先级比较的函数是通过 QueueSort 插件来实现的,默认启用的 QueueSort 插件是 PrioritySort,PrioriySort这个插件的核心实现就是其Less函数了:
// pkg/scheduler/framework/plugins/queuesort/priority_sort.go
// Less is the function used by the activeQ heap algorithm to sort pods.
// It sorts pods based on their priority. When priorities are equal, it uses
// PodQueueInfo.timestamp.
// Less 是 activeQ 队列用于对 pod 进行排序的函数
// 它根据 pod 的优先级对 pod 进行排序
// 当优先级相同时,它使用 PodQueueInfo.timestamp 进行比较
func (pl *PrioritySort) Less(pInfo1, pInfo2 *framework.QueuedPodInfo) bool {
p1 := corev1helpers.PodPriority(pInfo1.Pod)
p2 := corev1helpers.PodPriority(pInfo2.Pod)
return (p1 > p2) || (p1 == p2 && pInfo1.Timestamp.Before(pInfo2.Timestamp))
}
// k8s.io/component-helpers/scheduling/corev1/helpers.go
// PodPriority returns priority of the given pod.
// PodPriority返回给定pod的优先级。
func PodPriority(pod *v1.Pod) int32 {
if pod.Spec.Priority != nil {
return *pod.Spec.Priority
}
// When priority of a running pod is nil, it means it was created at a time
// that there was no global default priority class and the priority class
// name of the pod was empty. So, we resolve to the static default priority.
// 当运行中的pod的优先级为nil时,
// 这意味着它是在没有全局默认优先级类且pod的优先级类名称为空的时候创建的。
// 因此,我们确定为静态默认优先级。
return 0
}
到这里就真相大白了,对于 activeQ 活动队列中的 Pod 是依靠 PrioritySort 插件来进行优先级比较的,每个 Pod 在被创建后都会有一个 priority 属性来标记 Pod 优先级,我们也可以通过全局的 ProrityClass 对象来进行定义,然后在调度 Pod 的时候会先根据 Pod 优先级的高低进行比较,如果优先级相同,则回根据 Pod 的创建时间进行比较,越高优先级的 Pod 越被优先调度,越早创建的Pod 越优先。
那么 Pod 是在什么时候加入到 activeQ 活动队列的呢?还记得前面我们在创建 Scheduler 对象的时候有一个 addAllEventHandlers 函数吗?其中就有对未调度 Pod 的事件监听处理操作。
// pkg/scheduler/eventhandlers.go
func addAllEventHandlers(
sched *Scheduler,
informerFactory informers.SharedInformerFactory,
dynInformerFactory dynamicinformer.DynamicSharedInformerFactory,
gvkMap map[framework.GVK]framework.ActionType,
) {
// scheduled pod cache
...
// unscheduled pod queue
informerFactory.Core().V1().Pods().Informer().AddEventHandler(
cache.FilteringResourceEventHandler{
FilterFunc: func(obj interface{}) bool {
switch t := obj.(type) {
case *v1.Pod:
return !assignedPod(t) && responsibleForPod(t, sched.Profiles)
case cache.DeletedFinalStateUnknown:
if pod, ok := t.Obj.(*v1.Pod); ok {
// The carried object may be stale, so we don't use it to check if
// it's assigned or not.
return responsibleForPod(pod, sched.Profiles)
}
utilruntime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, sched))
return false
default:
utilruntime.HandleError(fmt.Errorf("unable to handle object in %T: %T", sched, obj))
return false
}
},
Handler: cache.ResourceEventHandlerFuncs{
AddFunc: sched.addPodToSchedulingQueue,
UpdateFunc: sched.updatePodInSchedulingQueue,
DeleteFunc: sched.deletePodFromSchedulingQueue,
},
},
)
...
}
当 Pod 有事件变化后,首先会通过 FilterFunc 函数进行过滤,如果 Pod 没有绑定到节点(未调度)并且使用的是指定的调度器才进入下面的 Handler 进行处理,比如当创建 Pod 以后就会有 onAdd 的添加事件,这里调用的就是 sched.addPodToSchedulingQueue 函数:
// pkg/scheduler/eventhandlers.go
// 添加未调度的pod到优先级队列
func (sched *Scheduler) addPodToSchedulingQueue(obj interface{}) {
pod := obj.(*v1.Pod)
klog.V(3).InfoS("Add event for unscheduled pod", "pod", klog.KObj(pod))
if err := sched.SchedulingQueue.Add(pod); err != nil {
utilruntime.HandleError(fmt.Errorf("unable to queue %T: %v", obj, err))
}
}
可以看到这里当 Pod 被创建后,会将 Pod 通过调度队列 SchedulingQueue 的 Add 函数添加到优先级队列中去:
// pkg/scheduler/eventhandlers.go
// Add adds a pod to the active queue. It should be called only when a new pod
// is added so there is no chance the pod is already in active/unschedulable/backoff queues
// Add 添加 pod 到 activeQ 活动队列,仅当添加了新的 pod 时才应调用它
// 这样 pod 就不会已经处于 active/unschedulable/backoff 队列中
func (p *PriorityQueue) Add(pod *v1.Pod) error {
p.lock.Lock()
defer p.lock.Unlock()
pInfo := p.newQueuedPodInfo(pod)
// 添加到 activeQ 队列中
if err := p.activeQ.Add(pInfo); err != nil {
klog.ErrorS(err, "Error adding pod to the active queue", "pod", klog.KObj(pod))
return err
}
// 如果在 unschedulableQ 队列中,则从该队列移除
if p.unschedulablePods.get(pod) != nil {
klog.ErrorS(nil, "Error: pod is already in the unschedulable queue", "pod", klog.KObj(pod))
p.unschedulablePods.delete(pod)
}
// Delete pod from backoffQ if it is backing off
// 从 backoff 队列中删除
if err := p.podBackoffQ.Delete(pInfo); err == nil {
klog.ErrorS(nil, "Error: pod is already in the podBackoff queue", "pod", klog.KObj(pod))
}
// 记录metrics
metrics.SchedulerQueueIncomingPods.WithLabelValues("active", PodAdd).Inc()
p.PodNominator.AddNominatedPod(pInfo.PodInfo, nil)
// 通知其他地方进行处理
p.cond.Broadcast()
return nil
}
调度pod
当我们把新创建的 Pod 添加到 activeQ 活动队列过后,就可以在另外的协程中从这个队列中弹出堆顶的元素来进行具体的调度处理了。这里就要回头本文开头部分调度器启动后执行的一个调度操作了 sched.scheduleOne ,对单个 Pod 进行调度的基本流程如下所示:che
- 通过优先级队列弹出需要调度的 Pod
- 在某些场景下跳过调度
- 执行 Schedule 调度函数进行真正的调度,找到 Pod 合适的节点
- 如果上面调度失败则会尝试抢占机制
- 如果调度成功则将该 Pod 和选定的节点进行假性绑定(临时绑定),存入到调度器 cache 中,而不需要等待绑定操作的发生,方便后续操作
- 异步执行真正的绑定操作,将节点名称添加到 Pod 的
spec.nodeName属性中进行绑定。
scheduleOne函数如下所示:
// pkg/scheduler/schedule_one.go
// scheduleOne does the entire scheduling workflow for a single pod. It is serialized on the scheduling algorithm's host fitting.
// ScheduleOne负责单个Pod的整个调度工作流。它被串行化在调度算法的主机适配上。
func (sched *Scheduler) scheduleOne(ctx context.Context) {
// 从 调度器中获取下一个要调度的 pod
podInfo := sched.NextPod()
...
}
scheduleOne 函数在最开始调用 sched.NextPod() 函数来获取现在需要调度的 Pod,其实就是上面 activeQ 活动队列中 Pop 出来的元素,当实例化 Scheduler 对象的时候就指定了 NextPod 函数:internalqueue.MakeNextPodFunc(podQueue) :
// pkg/scheduler/internal/queue/scheduling_queue.go
// MakeNextPodFunc returns a function to retrieve the next pod from a given
// scheduling queue
// MakeNextPodFunc返回一个函数,从给定的调度队列中检索下一个pod
func MakeNextPodFunc(queue SchedulingQueue) func() *framework.QueuedPodInfo {
return func() *framework.QueuedPodInfo {
podInfo, err := queue.Pop()
if err == nil {
klog.V(4).InfoS("About to try and schedule pod", "pod", klog.KObj(podInfo.Pod))
for plugin := range podInfo.UnschedulablePlugins {
metrics.UnschedulableReason(plugin, podInfo.Pod.Spec.SchedulerName).Dec()
}
return podInfo
}
klog.ErrorS(err, "Error while retrieving next pod from scheduling queue")
return nil
}
}
很明显这里就是调用优先级队列的Pop()函数来弹出队列中的Pod进行调度处理。
// pkg/scheduler/internal/queue/scheduling_queue.go
// Pop removes the head of the active queue and returns it. It blocks if the
// activeQ is empty and waits until a new item is added to the queue. It
// increments scheduling cycle when a pod is popped.
// Pop删除活动队列的头并返回它。如果activeQ为空,它将阻塞,并等待新项目添加到队列中。
// 当弹出pod时,它会增加调度周期。
func (p *PriorityQueue) Pop() (*framework.QueuedPodInfo, error) {
p.lock.Lock()
defer p.lock.Unlock()
for p.activeQ.Len() == 0 {
// When the queue is empty, invocation of Pop() is blocked until new item is enqueued.
// When Close() is called, the p.closed is set and the condition is broadcast,
// which causes this loop to continue and return from the Pop().
// 当队列为空时,将阻塞Pop()的调用,直到新元素入队
// 调用Close()时,将设置p.closed并广播condition,这将导致此循环继续从Pop()返回
if p.closed {
return nil, fmt.Errorf(queueClosed)
}
p.cond.Wait()
}
obj, err := p.activeQ.Pop()
if err != nil {
return nil, err
}
pInfo := obj.(*framework.QueuedPodInfo)
pInfo.Attempts++
// 增加调度周期次数
p.schedulingCycle++
return pInfo, nil
}
Pop() 函数就是从 activeQ 队列中弹出堆顶的元素返回即可。拿到了要调度的 Pod,接下来就是去真正执行调度逻辑了。
接着通过 Pod 指定的调度器获取对应的 profile 用于后续处理,其中包含有当前调度器对应的调度框架对象:
// pkg/scheduler/schedule_one.go
// scheduleOne does the entire scheduling workflow for a single pod. It is serialized on the scheduling algorithm's host fitting.
// ScheduleOne负责单个Pod的整个调度工作流。它被串行化在调度算法的主机适配上。
func (sched *Scheduler) scheduleOne(ctx context.Context) {
// 从 调度器中获取下一个要调度的 pod
podInfo := sched.NextPod()
// pod could be nil when schedulerQueue is closed
// 当schedulerQueue关闭时,pod可能为nil
if podInfo == nil || podInfo.Pod == nil {
return
}
pod := podInfo.Pod
// 根据pod来获取指定的framework
fwk, err := sched.frameworkForPod(pod)
if err != nil {
// This shouldn't happen, because we only accept for scheduling the pods
// which specify a scheduler name that matches one of the profiles.
// 这种情况不应该发生,因为我们只接受指定与其中一个配置文件匹配的调度器名称的Pod进行调度。
klog.ErrorS(err, "Error occurred")
return
}
// 某些情况下跳过调度 pod
if sched.skipPodSchedule(fwk, pod) {
return
}
...
}
// skipPodSchedule returns true if we could skip scheduling the pod for specified cases.
// 某些情况下跳过调度 pod 则返回 true
func (sched *Scheduler) skipPodSchedule(fwk framework.Framework, pod *v1.Pod) bool {
// Case 1: pod is being deleted.
// Case 1: pod 标记为删除
if pod.DeletionTimestamp != nil {
fwk.EventRecorder().Eventf(pod, nil, v1.EventTypeWarning, "FailedScheduling", "Scheduling", "skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
klog.V(3).InfoS("Skip schedule deleting pod", "pod", klog.KObj(pod))
return true
}
// Case 2: pod that has been assumed could be skipped.
// An assumed pod can be added again to the scheduling queue if it got an update event
// during its previous scheduling cycle but before getting assumed.
// Case 2: 可以跳过假定的pod。
// 如果一个假定的pod在其先前的调度周期中但在被假定之前收到更新事件,则可以将其再次添加到调度队列中。
isAssumed, err := sched.Cache.IsAssumedPod(pod)
if err != nil {
utilruntime.HandleError(fmt.Errorf("failed to check whether pod %s/%s is assumed: %v", pod.Namespace, pod.Name, err))
return false
}
return isAssumed
}
在真正开始调度之前,在某些场景下可能需要跳过Pod的调度,比如Pod被标记为删除,还有一些需要忽略的场景通过IsAssmedPod函数来确定:
// pkg/scheduler/internal/cache/cache.go
func (cache *cacheImpl) IsAssumedPod(pod *v1.Pod) (bool, error) {
key, err := framework.GetPodKey(pod)
if err != nil {
return false, err
}
cache.mu.RLock()
defer cache.mu.RUnlock()
return cache.assumedPods.Has(key), nil
}
经过上面的忽略 Pod 调度过程后,就会去同步尝试调度 Pod 找到适合的节点。整个调度过程和之前的调度框架流程图是一致的:
// pkg/scheduler/schedule_one.go
// ScheduleOne负责单个Pod的整个调度工作流。它被串行化在调度算法的主机适配上。
func (sched *Scheduler) scheduleOne(ctx context.Context) {
...
// Synchronously attempt to find a fit for the pod.
// 同步尝试找到适合pod的位置。
start := time.Now()
state := framework.NewCycleState()
state.SetRecordPluginMetrics(rand.Intn(100) < pluginMetricsSamplePercent)
// Initialize an empty podsToActivate struct, which will be filled up by plugins or stay empty.
// 初始化一个空的podsToActivate结构,它将被插件填满或保持为空。
podsToActivate := framework.NewPodsToActivate()
state.Write(framework.PodsToActivateKey, podsToActivate)
schedulingCycleCtx, cancel := context.WithCancel(ctx)
defer cancel()
// 真正执行调度的地方
scheduleResult, err := sched.SchedulePod(schedulingCycleCtx, fwk, state, pod)
if err != nil {
// SchedulePod() may have failed because the pod would not fit on any host, so we try to
// preempt, with the expectation that the next time the pod is tried for scheduling it
// will fit due to the preemption. It is also possible that a different pod will schedule
// into the resources that were preempted, but this is harmless.
// SchedulePod()函数执行可能会失败,因为pod无法在任何主机上运行
// 因此我们尝试进行抢占,并期望下一次尝试Pod进行调度时,因为抢占而可以正常调度
// 但也有可能不同的pod会调度到被抢占的资源中,但这没什么影响
var nominatingInfo *framework.NominatingInfo
if fitError, ok := err.(*framework.FitError); ok {
if !fwk.HasPostFilterPlugins() {
klog.V(3).InfoS("No PostFilter plugins are registered, so no preemption will be performed")
} else {
// Run PostFilter plugins to try to make the pod schedulable in a future scheduling cycle.
// 运行 PostFilter 插件,尝试在未来的调度周期中使 Pod 课调度
result, status := fwk.RunPostFilterPlugins(ctx, state, pod, fitError.Diagnosis.NodeToStatusMap)
if status.Code() == framework.Error {
klog.ErrorS(nil, "Status after running PostFilter plugins for pod", "pod", klog.KObj(pod), "status", status)
} else {
fitError.Diagnosis.PostFilterMsg = status.Message()
klog.V(5).InfoS("Status after running PostFilter plugins for pod", "pod", klog.KObj(pod), "status", status)
}
if result != nil {
nominatingInfo = result.NominatingInfo
}
}
// Pod did not fit anywhere, so it is counted as a failure. If preemption
// succeeds, the pod should get counted as a success the next time we try to
// schedule it. (hopefully)
// Pod不适合任何地方,因此被视为故障。
// 如果抢占成功,pod应该在我们下次尝试调度时被视为成功。(希望如此)
metrics.PodUnschedulable(fwk.ProfileName(), metrics.SinceInSeconds(start))
} else if err == ErrNoNodesAvailable {
nominatingInfo = clearNominatedNode
// No nodes available is counted as unschedulable rather than an error.
// 没有可用的节点被视为不可调度的,而不是错误。
metrics.PodUnschedulable(fwk.ProfileName(), metrics.SinceInSeconds(start))
} else {
nominatingInfo = clearNominatedNode
klog.ErrorS(err, "Error selecting node for pod", "pod", klog.KObj(pod))
metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
}
// handleSchedulingFailure 为 pod 记录一个事件,表示 pod 调度失败
// 另外,如果设置了 pod condition 和 nominated 提名节点名称,也要更新
// 这里最重要的是要将调度失败的pod加入到不可调度的pod的队列中去
sched.handleSchedulingFailure(fwk, podInfo, err, v1.PodReasonUnschedulable, nominatingInfo)
return
}
metrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInSeconds(start))
// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
// This allows us to keep scheduling without waiting on binding to occur.
// 告诉 cache 暂时绑定的 pod 现在正在指定的节点上运行,即使尚未绑定
// 这样我们就可以保持调度,而不必等待真正的绑定发生
assumedPodInfo := podInfo.DeepCopy() // 拷贝现在调度的pod的信息
assumedPod := assumedPodInfo.Pod
// assume modifies `assumedPod` by setting NodeName=scheduleResult.SuggestedHost
// assume 是通过设置 NodeName=scheduleResult.SuggestedHost 来修改 `assumedPod`的
err = sched.assume(assumedPod, scheduleResult.SuggestedHost)
if err != nil {
metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
// This is most probably result of a BUG in retrying logic.
// We report an error here so that pod scheduling can be retried.
// This relies on the fact that Error will check if the pod has been bound
// to a node and if so will not add it back to the unscheduled pods queue
// (otherwise this would cause an infinite loop).
// 这很可能是重试逻辑中的错误造成的
// 我们在这里报告错误,以便可以重试实例调度。
// 这依赖于Error将检查实例是否绑定了节点,如果绑定,则不会将其添加回未调度的实例队列(否则将导致无限循环)。
sched.handleSchedulingFailure(fwk, assumedPodInfo, err, SchedulerError, clearNominatedNode)
return
}
// Run the Reserve method of reserve plugins.
// 运行Reserve plugins的reserve方法
if sts := fwk.RunReservePluginsReserve(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost); !sts.IsSuccess() {
metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
// trigger un-reserve to clean up state associated with the reserved Pod
// 触发 un-reserve 插件以清理与 reserve 的 Pod 相关联的状态
fwk.RunReservePluginsUnreserve(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
// 从缓存中移除临时绑定的 pod
if forgetErr := sched.Cache.ForgetPod(assumedPod); forgetErr != nil {
klog.ErrorS(forgetErr, "Scheduler cache ForgetPod failed")
}
sched.handleSchedulingFailure(fwk, assumedPodInfo, sts.AsError(), SchedulerError, clearNominatedNode)
return
}
// Run "permit" plugins.
// 运行 "permit" 插件
runPermitStatus := fwk.RunPermitPlugins(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
if runPermitStatus.Code() != framework.Wait && !runPermitStatus.IsSuccess() {
// 当 permit 插件结果不是 Wait 并且没有执行成功的时候,运行错误处理
var reason string
if runPermitStatus.IsUnschedulable() {
metrics.PodUnschedulable(fwk.ProfileName(), metrics.SinceInSeconds(start))
reason = v1.PodReasonUnschedulable
} else {
metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
reason = SchedulerError
}
// One of the plugins returned status different than success or wait.
// 其中一个插件返回的状态不等于 success 或者 wait 的状态
// 触发 un-reserve 插件
fwk.RunReservePluginsUnreserve(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
// 从缓存中移除临时绑定的pod
if forgetErr := sched.Cache.ForgetPod(assumedPod); forgetErr != nil {
klog.ErrorS(forgetErr, "Scheduler cache ForgetPod failed")
}
sched.handleSchedulingFailure(fwk, assumedPodInfo, runPermitStatus.AsError(), reason, clearNominatedNode)
return
}
...
}
从上面函数可以看出就是通过 sched.SchedulePod 函数执行真正的调度选择合适的节点,如果调度出现了错误,比如没有得到任何合适的节点,这个时候如果注册了 PostFilter 插件,则会执行该插件,抢占操作就是在该插件中执行的,然后会将这个调度失败的 Pod 加入到 unschedulableQ 或者 podBackoffQ 队列中去。
如果调度成功,得到了合适的节点,则先将该 Pod 和选定的节点进行假性绑定(不用等待去执行真正的绑定操作),存入调度器的 cache 中去,然后执行 Reserve 预留插件,去预留节点上 Pod 需要的资源。
紧接着就是调度的最后阶段去执行 Permit 允许插件,用于阻止或者延迟 Pod 与节点的绑定。
上面调度的过程都完成过后,最后就是去异步和节点进行真正的绑定操作了。
// pkg/scheduler/schedule_one.go
// ScheduleOne负责单个Pod的整个调度工作流。它被串行化在调度算法的主机适配上。
func (sched *Scheduler) scheduleOne(ctx context.Context) {
...
// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
// 异步绑定 Pod 到选定的节点 (we can do this b/c of the assumption step above)
go func() {
bindingCycleCtx, cancel := context.WithCancel(ctx)
defer cancel()
metrics.SchedulerGoroutines.WithLabelValues(metrics.Binding).Inc()
defer metrics.SchedulerGoroutines.WithLabelValues(metrics.Binding).Dec()
// 首先调用 WaitOnPermit 扩展,与前面的 Permit 扩展点配合使用实现演示调度功能
waitOnPermitStatus := fwk.WaitOnPermit(bindingCycleCtx, assumedPod)
if !waitOnPermitStatus.IsSuccess() {
var reason string
if waitOnPermitStatus.IsUnschedulable() {
metrics.PodUnschedulable(fwk.ProfileName(), metrics.SinceInSeconds(start))
reason = v1.PodReasonUnschedulable
} else {
metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
reason = SchedulerError
}
// trigger un-reserve plugins to clean up state associated with the reserved Pod
// 触发 un-reserve 插件清理 reserver pod 关联的状态
fwk.RunReservePluginsUnreserve(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
if forgetErr := sched.Cache.ForgetPod(assumedPod); forgetErr != nil {
klog.ErrorS(forgetErr, "scheduler cache ForgetPod failed")
} else {
// "Forget"ing an assumed Pod in binding cycle should be treated as a PodDelete event,
// as the assumed Pod had occupied a certain amount of resources in scheduler cache.
// TODO(#103853): de-duplicate the logic.
// Avoid moving the assumed Pod itself as it's always Unschedulable.
// It's intentional to "defer" this operation; otherwise MoveAllToActiveOrBackoffQueue() would
// update `q.moveRequest` and thus move the assumed pod to backoffQ anyways.
// “忘记”绑定周期中的假定Pod应视为PodDelete事件,
// 因为假定的Pod在调度器缓存中占用了一定数量的资源。
// 避免移动假定的Pod本身,因为它总是不可调度的。
// 有意 “推迟” 此操作; 否则MoveAllToActiveOrBackoffQueue() 将更新 'q.moveRequest',从而将假定的pod移动到backoffQ。
defer sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(internalqueue.AssignedPodDelete, func(pod *v1.Pod) bool {
return assumedPod.UID != pod.UID
})
}
sched.handleSchedulingFailure(fwk, assumedPodInfo, waitOnPermitStatus.AsError(), reason, clearNominatedNode)
return
}
// Run "prebind" plugins.
// 运行 "pretend" 插件
preBindStatus := fwk.RunPreBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
if !preBindStatus.IsSuccess() {
metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
// trigger un-reserve plugins to clean up state associated with the reserved Pod
// 触发un-reserve插件来清理与保留Pod关联的状态
fwk.RunReservePluginsUnreserve(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
if forgetErr := sched.Cache.ForgetPod(assumedPod); forgetErr != nil {
klog.ErrorS(forgetErr, "scheduler cache ForgetPod failed")
} else {
// "Forget"ing an assumed Pod in binding cycle should be treated as a PodDelete event,
// as the assumed Pod had occupied a certain amount of resources in scheduler cache.
// TODO(#103853): de-duplicate the logic.
sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(internalqueue.AssignedPodDelete, nil)
}
sched.handleSchedulingFailure(fwk, assumedPodInfo, preBindStatus.AsError(), SchedulerError, clearNominatedNode)
return
}
// 调用 bind 函数进行真正的绑定
err := sched.bind(bindingCycleCtx, fwk, assumedPod, scheduleResult.SuggestedHost, state)
if err != nil {
metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
// trigger un-reserve plugins to clean up state associated with the reserved Pod
// 触发un-reserve插件来清理与保留Pod关联的状态
fwk.RunReservePluginsUnreserve(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
if err := sched.Cache.ForgetPod(assumedPod); err != nil {
klog.ErrorS(err, "scheduler cache ForgetPod failed")
} else {
// "Forget"ing an assumed Pod in binding cycle should be treated as a PodDelete event,
// as the assumed Pod had occupied a certain amount of resources in scheduler cache.
// TODO(#103853): de-duplicate the logic.
sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(internalqueue.AssignedPodDelete, nil)
}
sched.handleSchedulingFailure(fwk, assumedPodInfo, fmt.Errorf("binding rejected: %w", err), SchedulerError, clearNominatedNode)
return
}
// Calculating nodeResourceString can be heavy. Avoid it if klog verbosity is below 2.
klog.V(2).InfoS("Successfully bound pod to node", "pod", klog.KObj(pod), "node", scheduleResult.SuggestedHost, "evaluatedNodes", scheduleResult.EvaluatedNodes, "feasibleNodes", scheduleResult.FeasibleNodes)
metrics.PodScheduled(fwk.ProfileName(), metrics.SinceInSeconds(start))
metrics.PodSchedulingAttempts.Observe(float64(podInfo.Attempts))
metrics.PodSchedulingDuration.WithLabelValues(getAttemptsLabel(podInfo)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))
// Run "postbind" plugins.
// 绑定成功后,调用"postbind"插件
fwk.RunPostBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
// At the end of a successful binding cycle, move up Pods if needed.
if len(podsToActivate.Map) != 0 {
sched.SchedulingQueue.Activate(podsToActivate.Map)
// Unlike the logic in scheduling cycle, we don't bother deleting the entries
// as `podsToActivate.Map` is no longer consumed.
}
}()
}
执行真正的绑定操作是在一个单独的 goroutine 里面操作的,由于前面调度最后可能启用了 Permit 允许插件,所以这里首先需要调用内置的 WaitOnPermit 插件配合 Permit 插件,如果 Pod 是 waiting 状态,则 WaitOnPermit 将会一直阻塞,直到 approve 或 deny 该 Pod。
最后就和调度框架里面的流程一样了,依次调用 prebind、bind、postbind 插件完成真正的绑定操作。
失败与重试处理
podBackOffQ
平时我们在创建 Pod 后如果执行失败了,可能会有一些 backoff 这样的 events 信息,这个 backoff 是什么意思呢?
backoff(回退)机制是并发编程中非常常见的一种机制,即如果任务反复执行依旧失败,则会按次增加等待调度时间,降低重试的效率,从而避免反复失败浪费调度资源。
针对调度失败的 Pod 会优先存储在 backoff 队列中,等待后续重试。podBackOffQ 主要存储那些在多个 schedulingCyle 中依旧调度失败的情况,则会通过 backoff 机制,延迟等待调度时间。
backoffQ 也是一个优先级队列,其初始化也是在 Scheduler 初始化优先级队列的时候传入的,其中最重要的是比较队列中元素优先级的函数为pq.podsCompareBackoffCompleted:
// pkg/scheduler/internal/queue/scheduling_queue.go
// NewPriorityQueue creates a PriorityQueue object.
// NewPriorityQueue创建PriorityQueue对象。
func NewPriorityQueue(
lessFn framework.LessFunc,
informerFactory informers.SharedInformerFactory,
opts ...Option,
) *PriorityQueue {
options := defaultPriorityQueueOptions
for _, opt := range opts {
opt(&options)
}
comp := func(podInfo1, podInfo2 interface{}) bool {
pInfo1 := podInfo1.(*framework.QueuedPodInfo)
pInfo2 := podInfo2.(*framework.QueuedPodInfo)
return lessFn(pInfo1, pInfo2)
}
if options.podNominator == nil {
options.podNominator = NewPodNominator(informerFactory.Core().V1().Pods().Lister())
}
pq := &PriorityQueue{
PodNominator: options.podNominator,
clock: options.clock,
stop: make(chan struct{}),
podInitialBackoffDuration: options.podInitialBackoffDuration,
podMaxBackoffDuration: options.podMaxBackoffDuration,
podMaxInUnschedulablePodsDuration: options.podMaxInUnschedulablePodsDuration,
activeQ: heap.NewWithRecorder(podInfoKeyFunc, comp, metrics.NewActivePodsRecorder()),
unschedulablePods: newUnschedulablePods(metrics.NewUnschedulablePodsRecorder()),
moveRequestCycle: -1,
clusterEventMap: options.clusterEventMap,
}
pq.cond.L = &pq.lock
pq.podBackoffQ = heap.NewWithRecorder(podInfoKeyFunc, pq.podsCompareBackoffCompleted, metrics.NewBackoffPodsRecorder())
pq.nsLister = informerFactory.Core().V1().Namespaces().Lister()
return pq
}
// 比较 BackoffQ 队列中元素的优先级,谁的回退时间断,谁的优先级高
func (p *PriorityQueue) podsCompareBackoffCompleted(podInfo1, podInfo2 interface{}) bool {
pInfo1 := podInfo1.(*framework.QueuedPodInfo)
pInfo2 := podInfo2.(*framework.QueuedPodInfo)
bo1 := p.getBackoffTime(pInfo1)
bo2 := p.getBackoffTime(pInfo2)
return bo1.Before(bo2)
}
// getBackoffTime returns the time that podInfo completes backoff
// getBackoffTime 返回podInfo完成回退的时间
func (p *PriorityQueue) getBackoffTime(podInfo *framework.QueuedPodInfo) time.Time {
duration := p.calculateBackoffDuration(podInfo)
backoffTime := podInfo.Timestamp.Add(duration)
return backoffTime
}
// calculateBackoffDuration is a helper function for calculating the backoffDuration
// based on the number of attempts the pod has made.
// calculateBackoffDuration是一个辅助函数,用于根据pod的尝试次数计算backoffDuration。
func (p *PriorityQueue) calculateBackoffDuration(podInfo *framework.QueuedPodInfo) time.Duration {
duration := p.podInitialBackoffDuration
for i := 1; i < podInfo.Attempts; i++ {
// Use subtraction instead of addition or multiplication to avoid overflow.
if duration > p.podMaxBackoffDuration-duration {
return p.podMaxBackoffDuration
}
duration += duration
}
return duration
}
不可调度队列
unschedulableQ就是字面意思不可调度队列,该队列中的pod是已经被确定为不可调度的pod,虽说是个队列,实际的数据结构是一个map类型。
// pkg/scheduler/internal/queue/scheduling_queue.go
// UnschedulablePods holds pods that cannot be scheduled. This data structure
// is used to implement unschedulablePods.
// Unscheduleablepods持有无法调度的pod。此数据结构用于实现unschedulepods。
type UnschedulablePods struct {
// podInfoMap is a map key by a pod's full-name and the value is a pointer to the QueuedPodInfo.
// podInfoMap 是由 Pod 的全名(podname_namespace)构成的 map key,值是指向 QueuedPodInfo的指针
podInfoMap map[string]*framework.QueuedPodInfo
keyFunc func(*v1.Pod) string
// metricRecorder updates the counter when elements of an unschedulablePodsMap
// get added or removed, and it does nothing if it's nil
metricRecorder metrics.MetricRecorder
}
错误处理
scheduleOne 函数中当我们真正执行调度操作后,如果出现了错误,会调用handleSchedulingFailure函数来记录调度失败的事件,该函数中我们最关心的是调用Scheduler的Error回调函数,这个回调函数是在初始化调度器的时候传入的,是通过MakeDefauleErrorFunc函数得到一个回调函数,在这个函数中,会把当前调度失败的Pod加入到unschedulablePods不可调度队列或podBackoffQ队列中去:
// pkg/scheduler/schedule_one.go
// handleSchedulingFailure records an event for the pod that indicates the
// pod has failed to schedule. Also, update the pod condition and nominated node name if set.
// handleSchedulingFailure 为 pod 记录一个事件,表示 pod 调度失败
// 另外,如果设置了 pod condition 和 nominated 提名节点名称,也要更新
// 这里最重要的是要将调度失败的pod加入到不可调度的pod的队列中去
func (sched *Scheduler) handleSchedulingFailure(fwk framework.Framework, podInfo *framework.QueuedPodInfo, err error, reason string, nominatingInfo *framework.NominatingInfo) {
sched.Error(podInfo, err)
// Update the scheduling queue with the nominated pod information. Without
// this, there would be a race condition between the next scheduling cycle
// and the time the scheduler receives a Pod Update for the nominated pod.
// Here we check for nil only for tests.
if sched.SchedulingQueue != nil {
sched.SchedulingQueue.AddNominatedPod(podInfo.PodInfo, nominatingInfo)
}
pod := podInfo.Pod
msg := truncateMessage(err.Error())
fwk.EventRecorder().Eventf(pod, nil, v1.EventTypeWarning, "FailedScheduling", "Scheduling", msg)
if err := updatePod(sched.client, pod, &v1.PodCondition{
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
Reason: reason,
Message: err.Error(),
}, nominatingInfo); err != nil {
klog.ErrorS(err, "Error updating pod", "pod", klog.KObj(pod))
}
}
// pkg/scheduler/scheduler.go
// MakeDefaultErrorFunc construct a function to handle pod scheduler error
// MakeDefaultErrorFunc构造一个函数来处理pod调度程序错误
func MakeDefaultErrorFunc(client clientset.Interface, podLister corelisters.PodLister, podQueue internalqueue.SchedulingQueue, schedulerCache internalcache.Cache) func(*framework.QueuedPodInfo, error) {
return func(podInfo *framework.QueuedPodInfo, err error) {
pod := podInfo.Pod
...
// Check if the Pod exists in informer cache.
// 从 informer 缓存中获取 pod
cachedPod, err := podLister.Pods(pod.Namespace).Get(pod.Name)
if err != nil {
klog.InfoS("Pod doesn't exist in informer cache", "pod", klog.KObj(pod), "err", err)
return
}
...
// As <cachedPod> is from SharedInformer, we need to do a DeepCopy() here.
// 因为<cachedPod>来自SharedInformer,所以我们需要在这里执行DeepCopy()。
podInfo.PodInfo = framework.NewPodInfo(cachedPod.DeepCopy())
if err := podQueue.AddUnschedulableIfNotPresent(podInfo, podQueue.SchedulingCycle()); err != nil {
klog.ErrorS(err, "Error occurred")
}
}
}
真正加入到队列是通过调用函数podQueue.AddUnschedulableIfNotPresent来完成的:
// pkg/scheduler/internal/queue/scheduling_queue.go
// AddUnschedulableIfNotPresent inserts a pod that cannot be scheduled into
// the queue, unless it is already in the queue. Normally, PriorityQueue puts
// unschedulable pods in `unschedulablePods`. But if there has been a recent move
// request, then the pod is put in `podBackoffQ`.
// AddUnschedulableIfNotPresent 将一个不可调用的 Pod 插入到队列中
// 如果已经在队列中,则忽略
// 一般情况下优先级队列会把不可调度的 pod 加入到 `unschedulablePods`中
// 但如果最近有 move request,则会将 pod 加入到 `podBackoffQ`队列
func (p *PriorityQueue) AddUnschedulableIfNotPresent(pInfo *framework.QueuedPodInfo, podSchedulingCycle int64) error {
p.lock.Lock()
defer p.lock.Unlock()
pod := pInfo.Pod
// 查看是否已经在 unschedulablePods 中
if p.unschedulablePods.get(pod) != nil {
return fmt.Errorf("Pod %v is already present in unschedulable queue", klog.KObj(pod))
}
// 查看是否在 activeQ 中
if _, exists, _ := p.activeQ.Get(pInfo); exists {
return fmt.Errorf("Pod %v is already present in the active queue", klog.KObj(pod))
}
// 查看是否在 podBackoffQ 中
if _, exists, _ := p.podBackoffQ.Get(pInfo); exists {
return fmt.Errorf("Pod %v is already present in the backoff queue", klog.KObj(pod))
}
// Refresh the timestamp since the pod is re-added.
// 刷新 Pod 被重新添加后的时间戳
pInfo.Timestamp = p.clock.Now()
// If a move request has been received, move it to the BackoffQ, otherwise move
// it to unschedulablePods.
// 如果收到移动请求,请将其移动到BackoffQ,否则将其移动到unschedulablePods。
for plugin := range pInfo.UnschedulablePlugins {
metrics.UnschedulableReason(plugin, pInfo.Pod.Spec.SchedulerName).Inc()
}
if p.moveRequestCycle >= podSchedulingCycle {
if err := p.podBackoffQ.Add(pInfo); err != nil {
return fmt.Errorf("error adding pod %v to the backoff queue: %v", pod.Name, err)
}
metrics.SchedulerQueueIncomingPods.WithLabelValues("backoff", ScheduleAttemptFailure).Inc()
} else {
p.unschedulablePods.addOrUpdate(pInfo)
metrics.SchedulerQueueIncomingPods.WithLabelValues("unschedulable", ScheduleAttemptFailure).Inc()
}
p.PodNominator.AddNominatedPod(pInfo.PodInfo, nil)
return nil
}
在 Pod 调度失败时,会调用AddUnschedulableIfNotPresent函数,其中有一个逻辑:
- 如果
moveRequestCycle大于等于当前的podSchedulingCycle,则当前应该对之前已经失败的 Pod 进行重试,也就是放进 backoffQ 队列里 - 如果不满足,则放进 unscheduleableQ 不可调度队列里
对于 moveRequestCycle 这个属性只有集群资源发生过变更(在资源的事件监听处理器里面都会去设置 moveRequestCycle=podSchedulingCycle)才会等于podSchedulingCycle。理论上来说在 Pod 调度失败时,没有后续任何操作,会被放进 unscheduleableQ 不可调度队列,但是有可能 Pod 刚刚调度失败,在错误处理之前,忽然发生了资源变更,这个时候,由于在这个错误处理的间隙,集群的资源状态已经发生了变化,所以可以认为这个 Pod 有了被调度成功的可能性,所以就被放进了backoffQ重试队列中,等待快速重试。
那么 PriorityQueue 队列里面包含的3个子队列之间的数据是如何流转的呢?还是要从调度启动的函数入手分析:
// pkg/scheduler/scheduler.go
// Run begins watching and scheduling. It starts scheduling and blocked until the context is done.
// 等待 cache 同步完成,然后开始调度
func (sched *Scheduler) Run(ctx context.Context) {
sched.SchedulingQueue.Run()
wait.UntilWithContext(ctx, sched.scheduleOne, 0)
sched.SchedulingQueue.Close()
}
其中sched.SchedulingQueue.Run()函数就是运行PriorityQueue队列的Run()函数:
// pkg/scheduler/internal/queue/scheduling_queue.go
// Run starts the goroutine to pump from podBackoffQ to activeQ
// Run 启动协程,把 podBackoffQ 队列的数据放到 activeQ 活动队列中
func (p *PriorityQueue) Run() {
go wait.Until(p.flushBackoffQCompleted, 1.0*time.Second, p.stop)
go wait.Until(p.flushUnschedulablePodsLeftover, 30*time.Second, p.stop)
}
// flushBackoffQCompleted Moves all pods from backoffQ which have completed backoff in to activeQ
// flushBackoffQCompleted 将 backoffQ 队列中已经完成 backoff Pod 移动到 activeQ 中
func (p *PriorityQueue) flushBackoffQCompleted() {
p.lock.Lock()
defer p.lock.Unlock()
broadcast := false
for {
// 获取 heap 头元素(不删除)
rawPodInfo := p.podBackoffQ.Peek()
if rawPodInfo == nil {
break
}
pod := rawPodInfo.(*framework.QueuedPodInfo).Pod
// 如果该 Pod 回退完成的时间还没到,则忽略
boTime := p.getBackoffTime(rawPodInfo.(*framework.QueuedPodInfo))
if boTime.After(p.clock.Now()) {
break
}
// 完成了则弹出 heap 头部元素
_, err := p.podBackoffQ.Pop()
if err != nil {
klog.ErrorS(err, "Unable to pop pod from backoff queue despite backoff completion", "pod", klog.KObj(pod))
break
}
// 加入到 activeQ活动队列中
p.activeQ.Add(rawPodInfo)
metrics.SchedulerQueueIncomingPods.WithLabelValues("active", BackoffComplete).Inc()
broadcast = true
}
// 广播
if broadcast {
p.cond.Broadcast()
}
}
// flushUnschedulablePodsLeftover moves pods which stay in unschedulablePods
// longer than podMaxInUnschedulablePodsDuration to backoffQ or activeQ.
// flushunschedulepodsleftover 将在unschedulepods中停留的时间长于podmaxinunschedulepodsduration的pods移动到backoffQ或activeQ。
func (p *PriorityQueue) flushUnschedulablePodsLeftover() {
p.lock.Lock()
defer p.lock.Unlock()
var podsToMove []*framework.QueuedPodInfo
currentTime := p.clock.Now()
for _, pInfo := range p.unschedulablePods.podInfoMap {
// 最后调度的时间
lastScheduleTime := pInfo.Timestamp
// 如果 pod 在 unschedulableQ 队列中停留的时间超过 p.podMaxInUnschedulablePodsDuration
if currentTime.Sub(lastScheduleTime) > p.podMaxInUnschedulablePodsDuration {
podsToMove = append(podsToMove, pInfo)
}
}
if len(podsToMove) > 0 {
// 移动到活动队列或者backoff 队列
p.movePodsToActiveOrBackoffQueue(podsToMove, UnschedulableTimeout)
}
}
// NOTE: this function assumes lock has been acquired in caller
// NOTE: 此函数假定已在调用方中获取锁
func (p *PriorityQueue) movePodsToActiveOrBackoffQueue(podInfoList []*framework.QueuedPodInfo, event framework.ClusterEvent) {
moved := false
for _, pInfo := range podInfoList {
// If the event doesn't help making the Pod schedulable, continue.
// Note: we don't run the check if pInfo.UnschedulablePlugins is nil, which denotes
// either there is some abnormal error, or scheduling the pod failed by plugins other than PreFilter, Filter and Permit.
// In that case, it's desired to move it anyways.
// 如果该事件不能帮助Pod实现可调度,请继续。
// 注意:如果pInfo.UnSchedulablePlugins为nil,我们不会进行检查,这表示存在异常错误,或者是由PreFilter、Filter和Permit以外的插件调度实例失败。
// 在这种情况下,无论如何都希望移动它。
if len(pInfo.UnschedulablePlugins) != 0 && !p.podMatchesEvent(pInfo, event) {
continue
}
moved = true
pod := pInfo.Pod
// 如果还在 backoff 时间内
if p.isPodBackingoff(pInfo) {
// 添加到 podBackoffQ 队列
if err := p.podBackoffQ.Add(pInfo); err != nil {
klog.ErrorS(err, "Error adding pod to the backoff queue", "pod", klog.KObj(pod))
} else {
// 从 unschedulableQ 队列删除
metrics.SchedulerQueueIncomingPods.WithLabelValues("backoff", event.Label).Inc()
p.unschedulablePods.delete(pod)
}
} else {
// 添加到 activeQ 队列
if err := p.activeQ.Add(pInfo); err != nil {
klog.ErrorS(err, "Error adding pod to the scheduling queue", "pod", klog.KObj(pod))
} else {
// 从 unschedulableQ 队列删除
metrics.SchedulerQueueIncomingPods.WithLabelValues("active", event.Label).Inc()
p.unschedulablePods.delete(pod)
}
}
}
// 将 moveRequestCycle 设置为当前的 schedulingCycle
p.moveRequestCycle = p.schedulingCycle
if moved {
p.cond.Broadcast()
}
}
加上上面的 sched.scheduleOne 函数,3个子队列整体的工作流程就是:
- 每隔1秒,检测 backoffQ 里是否有 Pod 可以被放进 activeQ 里
- 每隔30秒,检测 unscheduleodQ 里是否有 Pod 可以被放进 activeQ 里(默认条件是等待时间超过60秒)
- 不停的调用 scheduleOne 方法,从 activeQ 里弹出 Pod 进行调度
如果一个 Pod 调度失败了,正常就是不可调度的,应该放入 unscheduleableQ 队列。如果集群内的资源状态一直不发生变化,这种情况,每隔60s这些 Pod 还是会被重新尝试调度一次。
但是一旦资源的状态发生了变化,这些不可调度的 Pod 就很可能可以被调度了,也就是 unscheduleableQ 中的 Pod 应该放进 backoffQ 里面去了。等待安排重新调度,backoffQ 里的 Pod 会根据重试的次数设定等待重试的时间,重试的次数越少,等待重新调度的时间也就越少。backOffQ 里的 Pod 调度的速度会比 unscheduleableQ 里的 Pod 快得多。