引言
在Kubernetes集群中,Pod的调度是核心功能之一。调度器(kube-scheduler)负责为新创建的Pod寻找最合适的节点,确保集群资源得到高效利用。在这里帅逼作者将深入探讨Pod调度的各种机制,从基础字段到高级调度策略,由简单到复杂,全面详细解析Kubernetes调度系统。
nodeName:指定节点
- nodeName:最直接的调度方式,强制Pod调度运行在指定节点上
apiVersion: v1
kind: Pod
metadata:
name: nginx-pod
spec:
nodeName: worker53 # 直接指定节点名称
containers:
- name: nginx
image: nginx:1.23
hostNetwork与hostPort:节点网络配置
hostNetwork示例
- hostNetwork:不为Pod分配新的网络名称空间,而是和宿主机共享网络。
- 通过以下的yaml案例可以看到,由于master节点一开始默认自带污点,所以pod无法调度到此节点,于是pod只能选择调度到另外两台worker节点,又因为hostNetwork共享宿主机网络,所以一台worker节点只能接收到一个pod调度,当两个worker节点上都已经被调度了一个pod时,其余的pod就会调度不成功而处于pending状态
apiVersion: apps/v1
kind: Deployment
metadata:
name: scheduler-hostnetwork
spec:
replicas: 5
selector:
matchLabels:
apps: nginx
template:
metadata:
labels:
apps: nginx
spec:
hostNetwork: true
containers:
- image: nginx:1.23
name: c1
# 容器对外暴露的端口号
ports:
# 指定容器的端口
- containerPort: 80
#测试与验证
[root@master51 ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
scheduler-hostnetwork-6967667b7d-n5pgn 1/1 Running 0 3s 10.0.0.53 worker53 <none> <none>
scheduler-hostnetwork-6967667b7d-sdnmd 0/1 Pending 0 3s <none> <none> <none> <none>
scheduler-hostnetwork-6967667b7d-sljk4 0/1 Pending 0 3s <none> <none> <none> <none>
scheduler-hostnetwork-6967667b7d-zkm9c 1/1 Running 0 3s 10.0.0.52 worker52 <none> <none>
scheduler-hostnetwork-6967667b7d-zqkl6 0/1 Pending 0 3s <none> <none> <none> <none>
[root@master51 ~]#
[root@master51 ~]# kubectl describe pod scheduler-hostnetwork-6967667b7d-zqkl6
Name: scheduler-hostnetwork-6967667b7d-zqkl6
Namespace: default
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 31s default-scheduler 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't have free ports for the requested pod ports.
[root@master51]#
hostPort示例
- hostPort:Pod使用worker节点进行NAT转发,从而达到暴露服务到K8S集群外部的一种方式
apiVersion: apps/v1
kind: Deployment
metadata:
name: scheduler-hostport
spec:
replicas: 5
selector:
matchLabels:
apps: nginx
template:
metadata:
labels:
apps: nginx
spec:
containers:
- image: nginx:1.23
name: c1
ports:
- containerPort: 80
# 将容器的80端口映射到宿主机的9090端口
hostPort: 9090
#测试与验证
[root@master51 ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
scheduler-hostport-dbfd5554f-2tvk8 0/1 Pending 0 3s <none> <none> <none> <none>
scheduler-hostport-dbfd5554f-7wrc9 0/1 Pending 0 3s <none> <none> <none> <none>
scheduler-hostport-dbfd5554f-nnn8w 1/1 Running 0 3s 10.100.203.178 worker52 <none> <none>
scheduler-hostport-dbfd5554f-r5jd4 1/1 Running 0 3s 10.100.140.145 worker53 <none> <none>
scheduler-hostport-dbfd5554f-sgbsz 0/1 Pending 0 3s <none> <none> <none> <none>
[root@master51 ~]#
[root@master51 ~]# kubectl describe pod scheduler-hostport-dbfd5554f-sgbsz
Name: scheduler-hostport-dbfd5554f-sgbsz
Namespace: default
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 55s default-scheduler 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't have free ports for the requested pod ports.
[root@master51 ~]#
resources:资源请求与限制
-
requests
- 容器的期望资源,如果节点无法满足该期望,则Pod不会调度到该节点。
- 调度到该节点后,并不会立刻消耗掉所有的期望资源。
- request只是期望,容器消耗的资源可以超过期望的值
-
limits
-
容器资源的使用上限。运行时限制,防止容器过度消耗资源
-
cpu超过limits
- 限制使用:CPU是可压缩资源,容器会被限流
- 不会终止:容器不会被杀死,只是运行速度会变慢
- 效果:CPU使用率会被限制在设定的limits内,性能可能下降
-
内存超过limits
- OOM Killer介入:内存是不可压缩资源,Linux内核会触发OOM(Out Of Memory)
- 容器被杀死:内核根据OOM评分选择进程终止,通常是容器内的进程
- 重启策略生效:容器会根据Pod的
restartPolicy重启
-
apiVersion: apps/v1
kind: Deployment
metadata:
name: scheduler-resources
spec:
replicas: 5
selector:
matchLabels:
apps: nginx
template:
metadata:
labels:
apps: nginx
spec:
containers:
- image: nginx:1.23
name: c1
ports:
- containerPort: 80
resources:
requests:
cpu: 500m
memory: 200Mi
limits:
cpu: 1.5
memory: 2Gi
Taints:节点污点
-
污点(Taint)允许节点拒绝某些Pod,影响pod的调度
-
Taints有三种类型:(对应的是effect)
- PreferNoSchedule:尽可能的调度到其他节点,当其他节点无法满足调度时,再调度到当前节点。
- NoSchedule:该节点不在接受新的Pod调度,但已经调度到该节点的Pod不会被驱逐。
- NoExecute:该节点不在接受新的Pod调度,且已经调度到该节点的Pod会被驱逐。
# 添加污点
kubectl taint nodes node1 key1=value1:NoSchedule
kubectl taint nodes node1 key2=value2:NoExecute
kubectl taint nodes node1 key3=value3:PreferNoSchedule
# 查看污点
kubectl describe node node1 | grep Taint
kubectl describe node | grep Taint #查看所有节点的污点
Taints: node-role.kubernetes.io/master:NoSchedule
Taints: <none>
Taints: <none>
# 删除污点
kubectl taint nodes node1 key1=value1:NoSchedule-
kubectl taint nodes key1- #移除所有节点以key1开头的污点
Tolerations:Pod污点容忍
- Pod通过容忍(Toleration)来声明可以接受哪些污点。如果Pod想要调度到某个节点,则必须容忍该节点的所有污点。
apiVersion: apps/v1
kind: Deployment
metadata:
name: scheduler-taints-tolerations
spec:
replicas: 5
selector:
matchLabels:
apps: nginx
template:
metadata:
labels:
apps: nginx
spec:
# 配置Pod的污点容忍
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
# 指定key和value之间的关系,有效值为: Exists, Equal(默认值)
# Exists:
# 只要key存在,则匹配所有的value,因此可以不定义value。
# Equal:
# 要求key的值必须是value指定的值。
operator: Exists
#operator: Equal
#容忍所有节点的所有污点
#tolerations:
#- operator: Exists
containers:
- image: nginx:1.23
name: c1
ports:
- containerPort: 80
resources:
requests:
cpu: 500m
memory: 1.5Gi
nodeSelector:节点选择器
- nodeSelector是节点选择器,选择pod调度到特定节点。
- 和nodeName不同的是,nodeSelector可以调度到多个符合选择的节点。
#先给节点打上标签
[root@master51 ~]# kubectl label nodes master51 worker53 mengnan=shic
node/master51 labeled
node/worker53 labeled
#查看标签
[root@master51 ~]# kubectl get nodes --show-labels | grep mengnan
master51 Ready control-plane,master 5d4h v1.23.17 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master51,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=,mengnan=shic
worker52 Ready <none> 5d4h v1.23.17 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker52,kubernetes.io/os=linux
worker53 Ready <none> 5d4h v1.23.17 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker53,kubernetes.io/os=linux,mengnan=shic
#资源清单
apiVersion: apps/v1
kind: Deployment
metadata:
name: scheduler-nodeselector
spec:
replicas: 5
selector:
matchLabels:
apps: nginx
template:
metadata:
labels:
apps: nginx
spec:
# 配置节点选择器
nodeSelector:
mengnan: shic
#容忍所有节点的所有污点
tolerations:
- operator: Exists
containers:
- image: nginx
name: c1
ports:
- containerPort: 80
#最后应用资源清单后可以看到pod调度到了master51和worker53两个节点上
cordon/uncordon:节点隔离
-
cordon可以标记节点不可调度,本质上是给改节点打了污点。
底层调用Taint 为节点添加
node.kubernetes.io/unschedulable:NoSchedule污点 -
uncordon就是cordon的反向操作。取消节点不可调度,底层也会删除污点。
#查看节点状态和污点情况
[root@master51 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master51 Ready control-plane,master 5d4h v1.23.17
worker52 Ready <none> 5d4h v1.23.17
worker53 Ready <none> 5d4h v1.23.17
[root@master51 ~]#kubectl describe node | grep Taint -A 1
Taints: node-role.kubernetes.io/master:NoSchedule
Unschedulable: false
--
Taints: <none>
Unschedulable: false
--
Taints: <none>
Unschedulable: false
#将节点标记为不可调度
[root@master51 ~]# kubectl cordon worker52
node/worker52 cordoned
#查看节点状态和污点情况
[root@master51 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master51 Ready control-plane,master 5d4h v1.23.17
worker52 Ready,SchedulingDisabled <none> 5d4h v1.23.17
worker53 Ready <none> 5d4h v1.23.17
[root@master51 ~]#kubectl describe node | grep Taint -A 1
Taints: node-role.kubernetes.io/master:NoSchedule
Unschedulable: false
--
Taints: node.kubernetes.io/unschedulable:NoSchedule
Unschedulable: true
--
Taints: <none>
Unschedulable: false
#取消节点不可调度
[root@master51 ~]# kubectl uncordon worker52
node/worker52 uncordoned
[root@master51 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master51 Ready control-plane,master 5d4h v1.23.17
worker52 Ready <none> 5d4h v1.23.17
worker53 Ready <none> 5d4h v1.23.17
[root@master51 ~]#kubectl describe node | grep Taint -A 1
Taints: node-role.kubernetes.io/master:NoSchedule
Unschedulable: false
--
Taints: <none>
Unschedulable: false
--
Taints: <none>
Unschedulable: false
drain:节点清空
- drain可以驱逐已经调度到该节点的Pod;其典型的应用场景就是Kubernets节点缩容。
- drain底层调用了cordon,cordon调用Taint 为节点添加
node.kubernetes.io/unschedulable:NoSchedule污点
# 安全清空节点
[root@master51 ~]#kubectl drain worker53 \
--ignore-daemonsets \ # 忽略DaemonSet
--delete-emptydir-data \ # 删除emptyDir数据
--force \ # 强制驱逐(如Pod不响应删除)
--timeout=300s # 超时时间
#drain底层调用了cordon
[root@master51 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master51 Ready control-plane,master 5d5h v1.23.17
worker52 Ready <none> 5d4h v1.23.17
worker53 Ready,SchedulingDisabled <none> 5d4h v1.23.17
[root@master51 ~]#kubectl describe node | grep Taint -A 1
Taints: node-role.kubernetes.io/master:NoSchedule
Unschedulable: false
--
Taints: <none>
Unschedulable: false
--
Taints: node.kubernetes.io/unschedulable:NoSchedule
Unschedulable: true
# 恢复节点
kubectl uncordon worker53
Affinity:亲和性
-
节点亲和性允许根据节点标签约束Pod调度,让Pod按照需求进行倾向性调度。
-
affinity的三种实现方式
-
nodeAffinity
- 让Pod更加倾向于哪些节点进行调度。其功能和nodeSelector类似,但更加灵活。
-
podAffinity
- 可以基于拓扑域进行调度,当第一个Pod调度到该拓扑域后,则后续的所有Pod都会往该拓扑域调度。
-
podAntiAffinity
- 和podAntiAffinity相反,当第一个Pod调度到该拓扑域后,则后续的所有Pod都不会往该拓扑域调度。
-
nodeAffinity示例
#打标签
[root@master51 ~]# kubectl label nodes master51 shic=mengnan
node/master51 labeled
[root@master51 ~]# kubectl label nodes worker53 shic=shuaibi
node/worker53 labeled
#nodeaffinity资源清单
apiVersion: apps/v1
kind: Deployment
metadata:
name: deploy-nodeaffinity
spec:
replicas: 5
selector:
matchLabels:
apps: nginx
template:
metadata:
labels:
apps: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: shic
values:
- mengnan
- shuaibi
# 声明key和values之间的关系,有效值为: In, NotIn, Exists, DoesNotExist. Gt, and Lt.
operator: In
tolerations:
- operator: Exists
containers:
- image: nginx:1.23
name: c1
#测试验证
#可以看到pod都调度到了标签为shic=mengnan或shic=shuaibi的node上
[root@master51 ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
deploy-nodeaffinity-686fbdc857-52b88 1/1 Running 0 3s 10.100.160.168 master51 <none> <none>
deploy-nodeaffinity-686fbdc857-5ltd5 1/1 Running 0 3s 10.100.160.169 master51 <none> <none>
deploy-nodeaffinity-686fbdc857-699dr 1/1 Running 0 3s 10.100.140.71 worker53 <none> <none>
deploy-nodeaffinity-686fbdc857-75tm2 1/1 Running 0 3s 10.100.140.69 worker53 <none> <none>
deploy-nodeaffinity-686fbdc857-jzw7k 1/1 Running 0 3s 10.100.140.70 worker53 <none> <none>
[root@master51 ~]#
podaffinity和podantiaffinity示例
#打标签
[root@master51 ~]# kubectl label nodes master51 dc=beijing
node/master51 labeled
[root@master51 ~]#
[root@master51 ~]# kubectl label nodes worker52 dc=shanghai
node/worker52 labeled
[root@master51 ~]#
[root@master51 ~]# kubectl label nodes worker53 dc=shenzhen
node/worker53 labeled
#podaffinity资源清单
apiVersion: apps/v1
kind: Deployment
metadata:
name: deploy-podaffinity
spec:
replicas: 5
selector:
matchLabels:
apps: nginx
template:
metadata:
labels:
apps: nginx
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
# 指定拓扑域
- topologyKey: dc
# 匹配对应的Pod
labelSelector:
matchLabels:
apps: nginx
tolerations:
- operator: Exists
containers:
- image: nginx:1.23
name: c1
#测试验证
#可以看到由于第一个pod调度到了带有标签key=dc的3个pod的中的worker53,所以其他pod也跟着调度到该节点
[root@master51 ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
deploy-xiuxian-podaffinity-9b4c679ff-6xs4q 1/1 Running 0 7s 10.100.140.76 worker53 <none> <none>
deploy-xiuxian-podaffinity-9b4c679ff-77cqt 1/1 Running 0 7s 10.100.140.75 worker53 <none> <none>
deploy-xiuxian-podaffinity-9b4c679ff-8lpdj 1/1 Running 0 7s 10.100.140.73 worker53 <none> <none>
deploy-xiuxian-podaffinity-9b4c679ff-jvskd 1/1 Running 0 7s 10.100.140.74 worker53 <none> <none>
deploy-xiuxian-podaffinity-9b4c679ff-lv5nd 1/1 Running 0 7s 10.100.140.72 worker53 <none> <none>
[root@master51 ~]#
#podantiaffinity资源清单
apiVersion: apps/v1
kind: Deployment
metadata:
name: deploy-podantiaffinity
spec:
replicas: 5
selector:
matchLabels:
apps: nginx
template:
metadata:
labels:
apps: nginx
spec:
affinity:
# 定义Pod的反亲和性
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: dc
labelSelector:
matchLabels:
apps: nginx
tolerations:
- operator: Exists
containers:
- image: nginx:1.23
name: c1
#测试与验证
#可以看到由于设置了反亲和性,所以第一个Pod调度到一个节点后,第二个pod调度到另外一个节点,第三个pod调度到前两个节点之外的第三个节点。由于只有3个节点,所以剩余的两个pod无处可去,处于pending状态
[root@master51 ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
deploy-podantiaffinity-64597f4865-27r2c 1/1 Running 0 7s 10.100.160.170 master51 <none> <none>
deploy-podantiaffinity-64597f4865-5wsjc 1/1 Running 0 7s 10.100.140.82 worker53 <none> <none>
deploy-podantiaffinity-64597f4865-lp62s 0/1 Pending 0 7s <none> <none> <none> <none>
deploy-podantiaffinity-64597f4865-qlnvk 1/1 Running 0 7s 10.100.203.135 worker52 <none> <none>
deploy-podantiaffinity-64597f4865-z79cm 0/1 Pending 0 7s <none> <none> <none> <none>
[root@master51 ~]#
[root@master51 ~]# kubectl describe pod deploy-podantiaffinity-64597f4865-z79cm
Name: deploy-xiuxian-podantiaffinity-64597f4865-z79cm
Namespace: default
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 57s default-scheduler 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules.
[root@master51 ~]#
Priority :优先级
-
Priority(优先级)是影响Kubernetes Pod调度的重要机制,它允许某些Pod在资源紧张时获得更高的调度优先级。
- 调度顺序:高优先级Pod在调度队列中优先
- 资源分配:资源紧张时高优先级Pod优先获得资源
- 抢占机制:高优先级Pod可抢占低优先级Pod(取决于配置)
- 系统稳定性:合理使用确保关键业务Pod正常运行
-
PriorityClass是一个集群级别的资源,定义了一个优先级类别,包含:
- value: 32位整数的优先级值,范围从-2,147,483,648到1,000,000,000(系统保留0-1,000,000,000)
- globalDefault: 是否作为默认优先级
- description: 描述信息
- preemptionPolicy: 抢占策略
-
Priority
- Pod通过
spec.priorityClassName字段指定使用的PriorityClass。
- Pod通过
priority示例
#资源清单
# high-priority.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class is for high priority pods"
preemptionPolicy: PreemptLowerPriority
---
# medium-priority.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: medium-priority
value: 500000
globalDefault: true
description: "Default priority class for medium priority pods"
preemptionPolicy: Never
---
# low-priority.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100000
globalDefault: false
description: "Low priority pods"
preemptionPolicy: Never
#验证测试
# 应用PriorityClass配置
kubectl apply -f high-priority.yaml
kubectl apply -f medium-priority.yaml
kubectl apply -f low-priority.yaml
# 查看所有PriorityClass
kubectl get priorityclass
#输出
NAME VALUE GLOBAL-DEFAULT AGE
high-priority 1000000 false 2m
medium-priority 500000 true 2m
low-priority 100000 false 2m
system-cluster-critical 2000000000 false 6d
system-node-critical 2000001000 false 6d
priorityClass示例
#资源清单
# high-priority-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: high-priority-app
labels:
app: high-priority
spec:
priorityClassName: high-priority
containers:
- name: nginx
image: nginx
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
---
# low-priority-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: low-priority-app
labels:
app: low-priority
spec:
priorityClassName: low-priority
containers:
- name: nginx
image: nginx
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 512Mi
cpu: 500m