大家好,我是南哥,今天和大家一起阅读 k8s endpoint 控制器源码。如果本文对你有一些帮助,请帮忙转发一下!
EndpointSubset
EndpointSubset 是一组具有公共端口集的地址,扩展的端点集是 Addresses (Pod IP 地址) 和 Ports (Service 名称和端口号) 的笛卡尔积。
下面是一个典型的 EndpointSubset 示例:
Name: "test",
Subsets: [
{
Addresses: [
{
"ip": "10.10.1.1"
},
{
"ip": "10.10.2.2"
}
],
Ports: [
{
"name": "a",
"port": 8675
},
{
"name": "b",
"port": 309
}
]
}
]
将上面的 Subset 转换为对应的端点集合:
a: [ 10.10.1.1:8675, 10.10.2.2:8675 ]
b: [ 10.10.1.1:309, 10.10.2.2:309 ]
EndPointController
首先来看看 Endpoints 控制器对象,该对象是实现 Endpoints 功能的核心对象。
// Controller manages selector-based service endpoints.
type Controller struct {
client clientset.Interface
eventBroadcaster record.EventBroadcaster
eventRecorder record.EventRecorder
// serviceLister is able to list/get services and is populated by the shared informer passed to
// NewEndpointController.
serviceLister corelisters.ServiceLister
// servicesSynced returns true if the service shared informer has been synced at least once.
// Added as a member to the struct to allow injection for testing.
servicesSynced cache.InformerSynced
// podLister is able to list/get pods and is populated by the shared informer passed to
// NewEndpointController.
podLister corelisters.PodLister
// podsSynced returns true if the pod shared informer has been synced at least once.
// Added as a member to the struct to allow injection for testing.
podsSynced cache.InformerSynced
// endpointsLister is able to list/get endpoints and is populated by the shared informer passed to
// NewEndpointController.
endpointsLister corelisters.EndpointsLister
// endpointsSynced returns true if the endpoints shared informer has been synced at least once.
// Added as a member to the struct to allow injection for testing.
endpointsSynced cache.InformerSynced
// Services that need to be updated. A channel is inappropriate here,
// because it allows services with lots of pods to be serviced much
// more often than services with few pods; it also would cause a
// service that's inserted multiple times to be processed more than
// necessary.
queue workqueue.TypedRateLimitingInterface[string]
// workerLoopPeriod is the time between worker runs. The workers process the queue of service and pod changes.
workerLoopPeriod time.Duration
// triggerTimeTracker is an util used to compute and export the EndpointsLastChangeTriggerTime
// annotation.
triggerTimeTracker *endpointsliceutil.TriggerTimeTracker
endpointUpdatesBatchPeriod time.Duration
}
初始化
NewEndpointController 方法用于 EndPoint 控制器对象的初始化工作,并返回一个实例化对象,控制器对象同时订阅了 Service, Pod, EndPoint 三种资源的变更事件。
// NewEndpointController returns a new *Controller.
func NewEndpointController(ctx context.Context, podInformer coreinformers.PodInformer, serviceInformer coreinformers.ServiceInformer,
endpointsInformer coreinformers.EndpointsInformer, client clientset.Interface, endpointUpdatesBatchPeriod time.Duration) *Controller {
broadcaster := record.NewBroadcaster(record.WithContext(ctx))
recorder := broadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "endpoint-controller"})
e := &Controller{
client: client,
queue: workqueue.NewTypedRateLimitingQueueWithConfig(
workqueue.DefaultTypedControllerRateLimiter[string](),
workqueue.TypedRateLimitingQueueConfig[string]{
Name: "endpoint",
},
),
workerLoopPeriod: time.Second,
}
serviceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: e.onServiceUpdate,
UpdateFunc: func(old, cur interface{}) {
e.onServiceUpdate(cur)
},
DeleteFunc: e.onServiceDelete,
})
e.serviceLister = serviceInformer.Lister()
e.servicesSynced = serviceInformer.Informer().HasSynced
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: e.addPod,
UpdateFunc: e.updatePod,
DeleteFunc: e.deletePod,
})
e.podLister = podInformer.Lister()
e.podsSynced = podInformer.Informer().HasSynced
endpointsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
DeleteFunc: e.onEndpointsDelete,
})
e.endpointsLister = endpointsInformer.Lister()
e.endpointsSynced = endpointsInformer.Informer().HasSynced
e.triggerTimeTracker = endpointsliceutil.NewTriggerTimeTracker()
e.eventBroadcaster = broadcaster
e.eventRecorder = recorder
e.endpointUpdatesBatchPeriod = endpointUpdatesBatchPeriod
return e
}
启动控制器
根据控制器的初始化方法 NewEndpointController 的调用链路,可以找到控制器开始启动和执行的地方。
// cmd/kube-controller-manager/app/core.go
func startEndpointsController(ctx context.Context, controllerContext ControllerContext, controllerName string) (controller.Interface, bool, error) {
go endpointcontroller.NewEndpointController(
ctx,
controllerContext.InformerFactory.Core().V1().Pods(),
controllerContext.InformerFactory.Core().V1().Services(),
controllerContext.InformerFactory.Core().V1().Endpoints(),
controllerContext.ClientBuilder.ClientOrDie("endpoint-controller"),
controllerContext.ComponentConfig.EndpointController.EndpointUpdatesBatchPeriod.Duration,
).Run(ctx, int(controllerContext.ComponentConfig.EndpointController.ConcurrentEndpointSyncs))
return nil, true, nil
}
具体逻辑方法
Controller.Run 方法执行具体的初始化逻辑。
// Run will not return until stopCh is closed. workers determines how many
// endpoints will be handled in parallel.
func (e *Controller) Run(ctx context.Context, workers int) {
defer utilruntime.HandleCrash()
// Start events processing pipeline.
e.eventBroadcaster.StartStructuredLogging(3)
e.eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: e.client.CoreV1().Events("")})
defer e.eventBroadcaster.Shutdown()
defer e.queue.ShutDown()
logger := klog.FromContext(ctx)
logger.Info("Starting endpoint controller")
defer logger.Info("Shutting down endpoint controller")
if !cache.WaitForNamedCacheSync("endpoint", ctx.Done(), e.podsSynced, e.servicesSynced, e.endpointsSynced) {
return
}
for i := 0; i < workers; i++ {
go wait.UntilWithContext(ctx, e.worker, e.workerLoopPeriod)
}
go func() {
defer utilruntime.HandleCrash()
e.checkLeftoverEndpoints()
}()
<-ctx.Done()
}
e.worker 方法本质上就是一个无限循环轮询器,不断从队列中取出 EndPoint 对象,然后进行对应的操作。
// worker runs a worker thread that just dequeues items, processes them, and
// marks them done. You may run as many of these in parallel as you wish; the
// workqueue guarantees that they will not end up processing the same service
// at the same time.
func (e *Controller) worker(ctx context.Context) {
for e.processNextWorkItem(ctx) {
}
}
func (e *Controller) processNextWorkItem(ctx context.Context) bool {
eKey, quit := e.queue.Get()
if quit {
return false
}
defer e.queue.Done(eKey)
logger := klog.FromContext(ctx)
err := e.syncService(ctx, eKey)
e.handleErr(logger, err, eKey)
return true
}
syncService
Controller 的回调处理方法是 syncService 方法,该方法是 EndPoint 控制器操作的核心方法,通过方法的命名,可以知道 EndPoint 主要关注的对象是 Service。
startTime := time.Now()
logger := klog.FromContext(ctx)
// 通过 key 解析出 Service 对象对应的 命名空间和名称
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
defer func() {
logger.V(4).Info("Finished syncing service endpoints", "service", klog.KRef(namespace, name), "startTime", time.Since(startTime))
}()
// 获取 Service 对象
service, err := e.serviceLister.Services(namespace).Get(name)
if err != nil {
if !errors.IsNotFound(err) {
return err
}
// Delete the corresponding endpoint, as the service has been deleted.
// TODO: Please note that this will delete an endpoint when a
// service is deleted. However, if we're down at the time when
// the service is deleted, we will miss that deletion, so this
// doesn't completely solve the problem. See #6877.
err = e.client.CoreV1().Endpoints(namespace).Delete(ctx, name, metav1.DeleteOptions{})
if err != nil && !errors.IsNotFound(err) {
return err
}
e.triggerTimeTracker.DeleteService(namespace, name)
return nil
}
// Service 类型为 ExternalName
// 直接返回
if service.Spec.Type == v1.ServiceTypeExternalName {
// services with Type ExternalName receive no endpoints from this controller;
// Ref: https://issues.k8s.io/105986
return nil
}
// Service 的标签选择器为 nil
// 这种情况下关联不到 EndPoint 对象
// 直接返回
if service.Spec.Selector == nil {
// services without a selector receive no endpoints from this controller;
// these services will receive the endpoints that are created out-of-band via the REST API.
return nil
}
logger.V(5).Info("About to update endpoints for service", "service", klog.KRef(namespace, name))
// 获取 Service 的标签选择器关联的 Pod 列表
pods, err := e.podLister.Pods(service.Namespace).List(labels.Set(service.Spec.Selector).AsSelectorPreValidated())
if err != nil {
// Since we're getting stuff from a local cache, it is
// basically impossible to get this error.
return err
}
// We call ComputeEndpointLastChangeTriggerTime here to make sure that the
// state of the trigger time tracker gets updated even if the sync turns out
// to be no-op and we don't update the endpoints object.
endpointsLastChangeTriggerTime := e.triggerTimeTracker.
ComputeEndpointLastChangeTriggerTime(namespace, service, pods)
// 初始化端点集合对象
subsets := []v1.EndpointSubset{}
// 初始化已就绪的 EndPoint 对象计数
var totalReadyEps int
// 初始化未就绪的 EndPoint 对象计数
var totalNotReadyEps int
<strong> // 遍历 Pod 列表
</strong> for _, pod := range pods {
// ShouldPodBeInEndpoints :
// pod 处于终止状态(phase == v1.PodFailed || phase == v1.PodSucceeded)
// pod IP 还未分配
// pod 正在被删除但是 includeTerminating 为 true
if !endpointsliceutil.ShouldPodBeInEndpoints(pod, service.Spec.PublishNotReadyAddresses) {
logger.V(5).Info("Pod is not included on endpoints for Service", "pod", klog.KObj(pod), "service", klog.KObj(service))
continue
}
// 实例化一个 EndpointAddress 对象
ep, err := podToEndpointAddressForService(service, pod)
if err != nil {
// this will happen, if the cluster runs with some nodes configured as dual stack and some as not
// such as the case of an upgrade..
logger.V(2).Info("Failed to find endpoint for service with ClusterIP on pod with error", "service", klog.KObj(service), "clusterIP", service.Spec.ClusterIP, "pod", klog.KObj(pod), "error", err)
continue
}
epa := *ep
if endpointsliceutil.ShouldSetHostname(pod, service) {
epa.Hostname = pod.Spec.Hostname
}
// Allow headless service not to have ports.
if len(service.Spec.Ports) == 0 {
if service.Spec.ClusterIP == api.ClusterIPNone {
// 构建一个新的对象添加到 subset中,这里 ports 为空数组
subsets, totalReadyEps, totalNotReadyEps = addEndpointSubset(logger, subsets, pod, epa, nil, service.Spec.PublishNotReadyAddresses)
// No need to repack subsets for headless service without ports.
}
} else {
for i := range service.Spec.Ports {
servicePort := &service.Spec.Ports[i]
portNum, err := podutil.FindPort(pod, servicePort)
if err != nil {
logger.V(4).Info("Failed to find port for service", "service", klog.KObj(service), "error", err)
continue
}
// 根据 Service 端口对象 + 端口号构建一个对象
epp := endpointPortFromServicePort(servicePort, portNum)
var readyEps, notReadyEps int
// 将构建好的对象追加到端点集合里
subsets, readyEps, notReadyEps = addEndpointSubset(logger, subsets, pod, epa, epp, service.Spec.PublishNotReadyAddresses)
// 累加已就绪的 EndPoint 对象计数
totalReadyEps = totalReadyEps + readyEps
// 累加未就绪的 EndPoint 对象计数
totalNotReadyEps = totalNotReadyEps + notReadyEps
}
}
}
// 计算并确定最后的 EndPoint 对象集合 (新的 EndPoint Set)
subsets = endpoints.RepackSubsets(subsets)
// 通过 informer 获取 Service 对象对应的 EndPoint Set
// 也就是当前的 EndPoint Set (旧的 EndPoint Set)
// See if there's actually an update here.
currentEndpoints, err := e.endpointsLister.Endpoints(service.Namespace).Get(service.Name)
if err != nil {
if !errors.IsNotFound(err) {
return err
}
currentEndpoints = &v1.Endpoints{
ObjectMeta: metav1.ObjectMeta{
Name: service.Name,
Labels: service.Labels,
},
}
}
// 如果 Service 的资源版本号未设置,就需要创建新的 EndPoints
createEndpoints := len(currentEndpoints.ResourceVersion) == 0
// Compare the sorted subsets and labels
// Remove the HeadlessService label from the endpoints if it exists,
// as this won't be set on the service itself
// and will cause a false negative in this diff check.
// But first check if it has that label to avoid expensive copies.
compareLabels := currentEndpoints.Labels
if _, ok := currentEndpoints.Labels[v1.IsHeadlessService]; ok {
compareLabels = utillabels.CloneAndRemoveLabel(currentEndpoints.Labels, v1.IsHeadlessService)
}
// When comparing the subsets, we ignore the difference in ResourceVersion of Pod to avoid unnecessary Endpoints
// updates caused by Pod updates that we don't care, e.g. annotation update.
// 对新的和旧的 EndPoint Set进行排序 + 比较操作
// 如果新的 Set 和旧的 Set 比较之后,没有任何差异
// 并且 Service 的版本号也不需要创建
// 直接返回就可以了
if !createEndpoints &&
endpointSubsetsEqualIgnoreResourceVersion(currentEndpoints.Subsets, subsets) &&
apiequality.Semantic.DeepEqual(compareLabels, service.Labels) &&
capacityAnnotationSetCorrectly(currentEndpoints.Annotations, currentEndpoints.Subsets) {
logger.V(5).Info("endpoints are equal, skipping update", "service", klog.KObj(service))
return nil
}
// 深度拷贝当前的 EndPoint Set
// 重新设置相关的 (最新) 属性
newEndpoints := currentEndpoints.DeepCopy()
newEndpoints.Subsets = subsets
newEndpoints.Labels = service.Labels
if newEndpoints.Annotations == nil {
newEndpoints.Annotations = make(map[string]string)
}
if !endpointsLastChangeTriggerTime.IsZero() {
newEndpoints.Annotations[v1.EndpointsLastChangeTriggerTime] =
endpointsLastChangeTriggerTime.UTC().Format(time.RFC3339Nano)
} else { // No new trigger time, clear the annotation.
delete(newEndpoints.Annotations, v1.EndpointsLastChangeTriggerTime)
}
if truncateEndpoints(newEndpoints) {
newEndpoints.Annotations[v1.EndpointsOverCapacity] = truncated
} else {
delete(newEndpoints.Annotations, v1.EndpointsOverCapacity)
}
if newEndpoints.Labels == nil {
newEndpoints.Labels = make(map[string]string)
}
if !helper.IsServiceIPSet(service) {
newEndpoints.Labels = utillabels.CloneAndAddLabel(newEndpoints.Labels, v1.IsHeadlessService, "")
} else {
newEndpoints.Labels = utillabels.CloneAndRemoveLabel(newEndpoints.Labels, v1.IsHeadlessService)
}
logger.V(4).Info("Update endpoints", "service", klog.KObj(service), "readyEndpoints", totalReadyEps, "notreadyEndpoints", totalNotReadyEps)
if createEndpoints {
// No previous endpoints, create them
// 创建新的 EndPoints
_, err = e.client.CoreV1().Endpoints(service.Namespace).Create(ctx, newEndpoints, metav1.CreateOptions{})
} else {
// Pre-existing
// 更新已有 EndPoints
_, err = e.client.CoreV1().Endpoints(service.Namespace).Update(ctx, newEndpoints, metav1.UpdateOptions{})
}
if err != nil {
if createEndpoints && errors.IsForbidden(err) {
// A request is forbidden primarily for two reasons:
// 1. namespace is terminating, endpoint creation is not allowed by default.
// 2. policy is misconfigured, in which case no service would function anywhere.
// Given the frequency of 1, we log at a lower level.
logger.V(5).Info("Forbidden from creating endpoints", "error", err)
// If the namespace is terminating, creates will continue to fail. Simply drop the item.
if errors.HasStatusCause(err, v1.NamespaceTerminatingCause) {
return nil
}
}
if createEndpoints {
e.eventRecorder.Eventf(newEndpoints, v1.EventTypeWarning, "FailedToCreateEndpoint", "Failed to create endpoint for service %v/%v: %v", service.Namespace, service.Name, err)
} else {
e.eventRecorder.Eventf(newEndpoints, v1.EventTypeWarning, "FailedToUpdateEndpoint", "Failed to update endpoint %v/%v: %v", service.Namespace, service.Name, err)
}
return err
}
return nil
}
</code></pre>
通过 `Controller.syncService` 方法的源代码,我们可以看到: `EndPoint` 对象每次同步时,都会执行如下的操作:
1. 根据参数 key 获取指定的 Service 对象
2. 获取 Service 对象的标签选择器关联的 Pod 列表
3. 通过 Service 和 Pod 列表计算出最新的 EndPoint 对象 (新) 集合
4. 通过 informer 获取 Service 对象对应的 EndPoint 对象 (旧) 集合
5. 如果新集合与旧集合对比,没有任何差异,说明不需要更新,直接退出方法即可
6. 根据 Service 资源版本号确定 EndPoints 对象的操作 (创建或更新) 并执行
---
description: endpointSlice
---
# 4.11 endpointSlice controller
EndpointSlice 是什么?相比于我们熟知的 endpoint ,有什么区别?
这里我们可以查看官方文档:
[https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/0752-endpointslices](https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/0752-endpointslices)
\
**使用Endpoints API,服务只有一个Endpoints资源**。这意味着它需要能够为支持相应服务的每个Pod存储IP地址和端口(网络端点)。这耗费了巨大的API资源。为了解决此问题,kube-proxy在每个节点上运行,并会监视Endpoints资源的任何更新。如果在Endpoints资源中甚至只有一个网络端点发生了更改,则整个对象也必须发送到kube-proxy的每个实例。
Endpoints API的另一个限制是它限制了可以为服务跟踪的网络端点的数量。**存储在etcd中的对象的默认大小限制为1.5MB。在某些情况下,可能会将Endpoints资源限制为5,000个Pod IP。**对于大多数没有超过5000个pod的用户而言,这不是问题,但是对于服务接近此大小的用户而言,这将成为一个重大问题。
为了说明这些问题在多大程度上变得重要,举一个简单的例子是有帮助的。考虑具有5,000个Pod的服务,它最终可能具有1.5MB的端点资源。如果该列表中的单个网络端点都发生了更改,则需要将完整的端点资源分配给集群中的每个节点。在具有3,000个节点的大型群集中,这成为一个很大的问题。每次更新将涉及跨集群发送4.5GB数据(1.5MB端点\* 3,000个节点)。这几乎足以耗费大量资源,并且每次端点更改都会发生这种情况。想象一下,如果滚动更新会导致全部5,000个Pod都被替换,那么传输的数据量超过22TB(等同于5000张DVD存储量)
<figure><img src="https://raw.gitcode.com/mouuii/k8s-learning/raw/main/.gitbook/assets/截屏2024-07-17 17.23.08.png" alt=""><figcaption></figcaption></figure>
## 使用EndpointSlice API拆分端点
EndpointSlice API旨在通过类似于分片的方法来解决此问题。我们没有使用单个Endpoints资源跟踪服务的所有Pod IP,而是将它们拆分为多个较小的EndpointSlice。
考虑一个示例,其中一个服务由15个容器支持。我们最终将获得一个跟踪所有端点的单个Endpoints资源。如果将EndpointSlices配置为每个存储5个端点,则最终将得到3个不同的EndpointSlices:
<figure><img src="https://raw.gitcode.com/mouuii/k8s-learning/raw/main/.gitbook/assets/截屏2024-07-17 17.23.56.png" alt=""><figcaption></figcaption></figure>
默认情况下,EndpointSlices每个存储多达100个端点,尽管可以使用--max-endpoints-per-slicekube-controller-manager上的标志进行配置。
### 入口函数
入口函数位于 cmd/kube-controller-manager/app/discovery.go
```go
func startEndpointSliceController(ctx context.Context, controllerContext ControllerContext, controllerName string) (controller.Interface, bool, error) {
go endpointslicecontroller.NewController(
ctx,
controllerContext.InformerFactory.Core().V1().Pods(),
controllerContext.InformerFactory.Core().V1().Services(),
controllerContext.InformerFactory.Core().V1().Nodes(),
controllerContext.InformerFactory.Discovery().V1().EndpointSlices(),
controllerContext.ComponentConfig.EndpointSliceController.MaxEndpointsPerSlice,
controllerContext.ClientBuilder.ClientOrDie("endpointslice-controller"),
controllerContext.ComponentConfig.EndpointSliceController.EndpointUpdatesBatchPeriod.Duration,
).Run(ctx, int(controllerContext.ComponentConfig.EndpointSliceController.ConcurrentServiceEndpointSyncs))
return nil, true, nil
}
构造函数
- maxEndpointsPerSlice 每组切片的最大 endpoint 数量。
- triggerTimeTracker 计算 service 和 pods 最后一次更新时间,并存到缓存,然会 2 者中最后一次更新的时间
- reconciler 控制器的核心逻辑所在
- features.TopologyAwareHints 是否开启拓扑感知提示特性,就近路由,比如节点 A B 属于同一区域,C D 属于另一个区域,pod 在 A B C D 节点上各有一个,查看 A B 节点上面的 ipvs 规则,会发现,通往该 pod service 的流量的 ipvs 后端,只有 A B 节点上的 pod ip ,C D 同理 ,可以参考这篇文章,说得很直白:Kubernetes Service 开启拓扑感知(就近访问)能力。
// NewController creates and initializes a new Controller
func NewController(ctx context.Context, podInformer coreinformers.PodInformer,
serviceInformer coreinformers.ServiceInformer,
nodeInformer coreinformers.NodeInformer,
endpointSliceInformer discoveryinformers.EndpointSliceInformer,
maxEndpointsPerSlice int32,
client clientset.Interface,
endpointUpdatesBatchPeriod time.Duration,
) *Controller {
broadcaster := record.NewBroadcaster(record.WithContext(ctx))
recorder := broadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "endpoint-slice-controller"})
endpointslicemetrics.RegisterMetrics()
c := &Controller{
client: client,
// This is similar to the DefaultControllerRateLimiter, just with a
// significantly higher default backoff (1s vs 5ms). This controller
// processes events that can require significant EndpointSlice changes,
// such as an update to a Service or Deployment. A more significant
// rate limit back off here helps ensure that the Controller does not
// overwhelm the API Server.
queue: workqueue.NewTypedRateLimitingQueueWithConfig(
workqueue.NewTypedMaxOfRateLimiter(
workqueue.NewTypedItemExponentialFailureRateLimiter[string](defaultSyncBackOff, maxSyncBackOff),
// 10 qps, 100 bucket size. This is only for retry speed and its
// only the overall factor (not per item).
&workqueue.TypedBucketRateLimiter[string]{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
),
workqueue.TypedRateLimitingQueueConfig[string]{
Name: "endpoint_slice",
},
),
workerLoopPeriod: time.Second,
}
serviceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: c.onServiceUpdate,
UpdateFunc: func(old, cur interface{}) {
c.onServiceUpdate(cur)
},
DeleteFunc: c.onServiceDelete,
})
c.serviceLister = serviceInformer.Lister()
c.servicesSynced = serviceInformer.Informer().HasSynced
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: c.addPod,
UpdateFunc: c.updatePod,
DeleteFunc: c.deletePod,
})
c.podLister = podInformer.Lister()
c.podsSynced = podInformer.Informer().HasSynced
c.nodeLister = nodeInformer.Lister()
c.nodesSynced = nodeInformer.Informer().HasSynced
logger := klog.FromContext(ctx)
endpointSliceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: c.onEndpointSliceAdd,
UpdateFunc: func(oldObj, newObj interface{}) {
c.onEndpointSliceUpdate(logger, oldObj, newObj)
},
DeleteFunc: c.onEndpointSliceDelete,
})
c.endpointSliceLister = endpointSliceInformer.Lister()
c.endpointSlicesSynced = endpointSliceInformer.Informer().HasSynced
c.endpointSliceTracker = endpointsliceutil.NewEndpointSliceTracker()
c.maxEndpointsPerSlice = maxEndpointsPerSlice
c.triggerTimeTracker = endpointsliceutil.NewTriggerTimeTracker()
c.eventBroadcaster = broadcaster
c.eventRecorder = recorder
c.endpointUpdatesBatchPeriod = endpointUpdatesBatchPeriod
if utilfeature.DefaultFeatureGate.Enabled(features.TopologyAwareHints) {
nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
c.addNode(logger, obj)
},
UpdateFunc: func(oldObj, newObj interface{}) {
c.updateNode(logger, oldObj, newObj)
},
DeleteFunc: func(obj interface{}) {
c.deleteNode(logger, obj)
},
})
c.topologyCache = topologycache.NewTopologyCache()
}
c.reconciler = endpointslicerec.NewReconciler(
c.client,
c.nodeLister,
c.maxEndpointsPerSlice,
c.endpointSliceTracker,
c.topologyCache,
c.eventRecorder,
controllerName,
endpointslicerec.WithTrafficDistributionEnabled(utilfeature.DefaultFeatureGate.Enabled(features.ServiceTrafficDistribution)),
)
return c
}
监听
监听 service pod node endpointSlice 对象。
service 对象
-
AddFunc
onServiceUpdate 缓存 service Selector ,并加入令牌桶队列。
-
UpdateFunc
onServiceUpdate 缓存 service Selector ,并加入令牌桶队列。
-
DeleteFunc
onServiceDelete 删除缓存的 service Selector ,并加入令牌桶队列。
pod 对象
-
AddFunc
addPod
根据 pod 获取 service 对象,并把对应的 service 加入到延迟队列。
-
UpdateFunc
updatePod 同上。
-
DeleteFunc
deletePod
如果 pod 对象不为 nil ,调用 addPod 事件函数处理。
node 对象
只有启用了 TopologyAwareHints 特性,才有对应的监听事件。
-
addNode
调用 c.checkNodeTopologyDistribution() 检查节点拓扑分布情况。
-
updateNode
检查节点状态,调用 c.checkNodeTopologyDistribution() 检查节点拓扑分布情况。
-
deleteNode
调用 c.checkNodeTopologyDistribution() 检查节点拓扑分布情况。
endpointSlice 对象
-
AddFunc
onEndpointSliceAdd
调用 c.queueServiceForEndpointSlice() 接口,获取 service 唯一 key ,并计算更新延迟,按照延迟时间加入到延迟队列。
-
UpdateFunc
onEndpointSliceUpdate
最终调用 c.queueServiceForEndpointSlice() 接口,获取 service 唯一 key ,并计算更新延迟,按照延迟时间加入到延迟队列。
-
DeleteFunc
onEndpointSliceDelete
判断是否需要被删除,如果不希望被删除,则调用 c.queueServiceForEndpointSlice() 接口,获取 service 唯一 key ,并计算更新延迟,按照延迟时间加入到延迟队列。
syncService
核心逻辑入口 syncService ,实际最终调用的是 r.finalize() 函数。
// serviceQueueWorker runs a worker thread that just dequeues items, processes
// them, and marks them done. You may run as many of these in parallel as you
// wish; the workqueue guarantees that they will not end up processing the same
// service at the same time
func (c *Controller) serviceQueueWorker(logger klog.Logger) {
for c.processNextServiceWorkItem(logger) {
}
}
func (c *Controller) processNextServiceWorkItem(logger klog.Logger) bool {
cKey, quit := c.serviceQueue.Get()
if quit {
return false
}
defer c.serviceQueue.Done(cKey)
err := c.syncService(logger, cKey)
c.handleErr(logger, err, cKey)
return true
}
syncService
- 获取 service 对象。
- 根据 service 的标签获取 pods (这里获取到的 pods 就是 slicesToCreate 凭据的点)。
- 根据 service 命名空间和标签获取 apiserver 已有的所有关联的 endpointSlices 。
- 过滤掉被标记为删除的 endpointSlice 。
- 实际最终调用 c.reconciler.reconcile() 。
func (c *Controller) syncService(logger klog.Logger, key string) error {
startTime := time.Now()
defer func() {
logger.V(4).Info("Finished syncing service endpoint slices", "key", key, "elapsedTime", time.Since(startTime))
}()
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
service, err := c.serviceLister.Services(namespace).Get(name)
if err != nil {
if !apierrors.IsNotFound(err) {
return err
}
c.triggerTimeTracker.DeleteService(namespace, name)
c.reconciler.DeleteService(namespace, name)
c.endpointSliceTracker.DeleteService(namespace, name)
// The service has been deleted, return nil so that it won't be retried.
return nil
}
if service.Spec.Type == v1.ServiceTypeExternalName {
// services with Type ExternalName receive no endpoints from this controller;
// Ref: https://issues.k8s.io/105986
return nil
}
if service.Spec.Selector == nil {
// services without a selector receive no endpoint slices from this controller;
// these services will receive endpoint slices that are created out-of-band via the REST API.
return nil
}
logger.V(5).Info("About to update endpoint slices for service", "key", key)
podLabelSelector := labels.Set(service.Spec.Selector).AsSelectorPreValidated()
pods, err := c.podLister.Pods(service.Namespace).List(podLabelSelector)
if err != nil {
// Since we're getting stuff from a local cache, it is basically
// impossible to get this error.
c.eventRecorder.Eventf(service, v1.EventTypeWarning, "FailedToListPods",
"Error listing Pods for Service %s/%s: %v", service.Namespace, service.Name, err)
return err
}
esLabelSelector := labels.Set(map[string]string{
discovery.LabelServiceName: service.Name,
discovery.LabelManagedBy: c.reconciler.GetControllerName(),
}).AsSelectorPreValidated()
endpointSlices, err := c.endpointSliceLister.EndpointSlices(service.Namespace).List(esLabelSelector)
if err != nil {
// Since we're getting stuff from a local cache, it is basically
// impossible to get this error.
c.eventRecorder.Eventf(service, v1.EventTypeWarning, "FailedToListEndpointSlices",
"Error listing Endpoint Slices for Service %s/%s: %v", service.Namespace, service.Name, err)
return err
}
// Drop EndpointSlices that have been marked for deletion to prevent the controller from getting stuck.
endpointSlices = dropEndpointSlicesPendingDeletion(endpointSlices)
if c.endpointSliceTracker.StaleSlices(service, endpointSlices) {
return endpointslicepkg.NewStaleInformerCache("EndpointSlice informer cache is out of date")
}
// We call ComputeEndpointLastChangeTriggerTime here to make sure that the
// state of the trigger time tracker gets updated even if the sync turns out
// to be no-op and we don't update the EndpointSlice objects.
lastChangeTriggerTime := c.triggerTimeTracker.
ComputeEndpointLastChangeTriggerTime(namespace, service, pods)
err = c.reconciler.Reconcile(logger, service, pods, endpointSlices, lastChangeTriggerTime)
if err != nil {
c.eventRecorder.Eventf(service, v1.EventTypeWarning, "FailedToUpdateEndpointSlices",
"Error updating Endpoint Slices for Service %s/%s: %v", service.Namespace, service.Name, err)
return err
}
return nil
}
reconcile
c.reconciler.reconcile()
声明了两个切片 slicesToDelete , map slicesByAddressType
- 检查 endpointSlice 的 AddressType ,不再支持的类型的加入到 slicesToDelete 等待删除,支持的加入 slicesByAddressType 。
- 不同地址类型的 endpointSlice 都会调用 r.reconcileByAddressType() 函数去调谐,传的参数里面就包含了地址类型。
// Reconcile takes a set of pods currently matching a service selector and
// compares them with the endpoints already present in any existing endpoint
// slices for the given service. It creates, updates, or deletes endpoint slices
// to ensure the desired set of pods are represented by endpoint slices.
func (r *Reconciler) Reconcile(logger klog.Logger, service *corev1.Service, pods []*corev1.Pod, existingSlices []*discovery.EndpointSlice, triggerTime time.Time) error {
slicesToDelete := []*discovery.EndpointSlice{} // slices that are no longer matching any address the service has
errs := []error{} // all errors generated in the process of reconciling
slicesByAddressType := make(map[discovery.AddressType][]*discovery.EndpointSlice) // slices by address type
// addresses that this service supports [o(1) find]
serviceSupportedAddressesTypes := getAddressTypesForService(logger, service)
// loop through slices identifying their address type.
// slices that no longer match address type supported by services
// go to delete, other slices goes to the Reconciler machinery
// for further adjustment
for _, existingSlice := range existingSlices {
// service no longer supports that address type, add it to deleted slices
if !serviceSupportedAddressesTypes.Has(existingSlice.AddressType) {
if r.topologyCache != nil {
svcKey, err := ServiceControllerKey(existingSlice)
if err != nil {
logger.Info("Couldn't get key to remove EndpointSlice from topology cache", "existingSlice", existingSlice, "err", err)
} else {
r.topologyCache.RemoveHints(svcKey, existingSlice.AddressType)
}
}
slicesToDelete = append(slicesToDelete, existingSlice)
continue
}
// add list if it is not on our map
if _, ok := slicesByAddressType[existingSlice.AddressType]; !ok {
slicesByAddressType[existingSlice.AddressType] = make([]*discovery.EndpointSlice, 0, 1)
}
slicesByAddressType[existingSlice.AddressType] = append(slicesByAddressType[existingSlice.AddressType], existingSlice)
}
// reconcile for existing.
for addressType := range serviceSupportedAddressesTypes {
existingSlices := slicesByAddressType[addressType]
err := r.reconcileByAddressType(logger, service, pods, existingSlices, triggerTime, addressType)
if err != nil {
errs = append(errs, err)
}
}
// delete those which are of addressType that is no longer supported
// by the service
for _, sliceToDelete := range slicesToDelete {
err := r.client.DiscoveryV1().EndpointSlices(service.Namespace).Delete(context.TODO(), sliceToDelete.Name, metav1.DeleteOptions{})
if err != nil {
errs = append(errs, fmt.Errorf("error deleting %s EndpointSlice for Service %s/%s: %w", sliceToDelete.Name, service.Namespace, service.Name, err))
} else {
r.endpointSliceTracker.ExpectDeletion(sliceToDelete)
metrics.EndpointSliceChanges.WithLabelValues("delete").Inc()
}
}
return utilerrors.NewAggregate(errs)
}
r.reconcileByAddressType()
- 数组 slicesToCreate 、 slicesToUpdate 、 slicesToDelete 。
- 构建一个用于存放 endpointSlice 存在状态的结构体 existingSlicesByPortMap 。
- 构建一个用于存放 endpointSlice 期望状态的结构体 desiredEndpointsByPortMap 。
- 确定每组 endpointSlice 是否需要更新,调用 r.reconcileByPortMapping() 计算需要更新的 endpointSlice ,并返回 slicesToCreate, slicesToUpdate, slicesToDelete, numAdded, numRemoved 对象(计算过程遍历每个 slice 并填满至设定好的 endpoint 个数,默认 100 个,总长度不满 100 的单独一个 slice )给 r.finalize() 函数处理。
- 调用 r.finalize() 创建、更新或删除指定的 endpointSlice 对象。
r.finalize()
- 当同时有需要删除和新增的 slice 时,会优先把要删除的 slice 名替换到需要新增的 slice 上,再执行 slice 更新(意图是减少开销? 比如,要新增 A B C 三个,要删除 D E 两个,会遍历需要新增的 slice ,把 A 名替换成 D 的,B 替换成 E 的,再执行更新)
- 之后依次执行新增,更新和删除 slices 。
总结
-
总的来说,跟其他的控制器的逻辑是差不多的,都是先监听相关资源的事件,然后调谐。
-
从上面的代码我们也不难看出,endpointslice 有个特点就是,默认情况下,每个 slice 都是满 100 个条目就 new 一个新的切片,把每个切片的容量都控制在 100 个条目以内。
-
我们看完 endpointslice ,该控制器具有新增,更新和删除 slices 的功能,但是我们还发现源码里头还有 endpointslicemirroring 控制器。
-
endpointslicemirroring:在某些场合,应用会创建定制的 Endpoints 资源。为了保证这些应用不需要并发的更改 Endpoints 和 EndpointSlice 资源,集群的控制面将大多数 Endpoints 映射到对应的 EndpointSlice 之上。
控制面对 Endpoints 资源进行映射的例外情况有:
- Endpoints 资源上标签 endpointslice.kubernetes.io/skip-mirror 值为 true。
- Endpoints 资源包含标签 control-plane.alpha.kubernetes.io/leader。
- 对应的 Service 资源不存在。
- 对应的 Service 的选择算符不为空。
-
endpointslicemirroring 控制器我们等有时间再看看,我们先看看其他组件。
本文由博客一文多发平台 OpenWrite 发布!