阅读 548

Kubernetes源码浅析之Kubelet

由于工作需要,之前研究了下K8S的主要模块源码,现在分享出来,以供大家共同学习和研究。

Kubernetes在每个运行的节点(Node)上都会运行Kubelet程序负责管理该节点上的Pod,本文从创建一个Pod入手,从源码角度分析Kubelet这部分的基本原理:

(以下代码以v1.18版本为例)

一、Kubelet在进行Pod生命周期管理过程中,会对不同事件进行监听并做相应增删改查的处理,主要是通过syncLoop和syncLoopIteration两个方法实现的。

syncLoop是处理这种变更的主循环,通过一个不会退出的for循环调用syncLoopIteration方法对一些事件持续监听,这些事件主要来自于三个管道(文件、 apiserver 和http url),syncloop会将这些监听到的事件汇聚起来进行后续处理。当监听到的新的变化发生时,它会调用相应的处理函数,使得Pod从当前状态向期望的状态迁移(sync);如果配置没有变化,它也会按照sync frequency配置的周期去同步,以保证Pod及容器满足期望状态。

在函数的开始部分,初始化了两个定时器syncTicker和housekeepingTicker,用于指定上面说的定时同步的周期以及housekeeping清理的周期;接下来定义了一个保存PLEG(PodLifecycleEventGenerator,开辟了一块缓存用于记录Pod的生命周期事件,如container的启动、终止、异常等)更新信息的管道plegCh。在循环调用syncLoopIteration方法时,kubelet会定期对于模块(如PLEG)进行建康检查并记录到runtimeState中,假如在检查过程中runtimeErrors()返回异常,会采用exponential backoff方式(第一次等待200ms,下一次等待时间加倍,最大等待时间为5秒)进行休眠等待后继续循环直到恢复。

注:有时候会遇到类似故障,kubelet日志中会显示“skipping pod synchronization - PLEG is not healthy” 这种类似错误,此时会导致syncLoopIteration方法无法调用也就是kubelet无法对Pod进行生命周期管理了,这时要深入分析需要看下相应问题模块的Healthy函数确认可能的问题原因。

上面的部分可以理解为主循环内进行的准备工作,接下来就进入syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh)这个方法了。

/pkg/kubelet/kubelet.go

// syncLoop is the main loop for processing changes. It watches for changes from
 
// three channels (file, apiserver, and http) and creates a union of them. For
 
// any new change seen, will run a sync against desired state and running state. If
 
// no changes are seen to the configuration, will synchronize the last known desired
 
// state every sync-frequency seconds. Never returns.
 
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
 
    klog.Info("Starting kubelet main sync loop.")
 
    // The syncTicker wakes up kubelet to checks if there are any pod workers
 
    // that need to be sync'd. A one-second period is sufficient because the
 
    // sync interval is defaulted to 10s.
 
    syncTicker := time.NewTicker(time.Second)
 
    defer syncTicker.Stop()
 
    housekeepingTicker := time.NewTicker(housekeepingPeriod)
 
    defer housekeepingTicker.Stop()
 
    plegCh := kl.pleg.Watch()
 
    const (
 
        base   = 100 * time.Millisecond
 
        max    = 5 * time.Second
 
        factor = 2
 
    )
 
    duration := base
 
    // Responsible for checking limits in resolv.conf
 
    // The limits do not have anything to do with individual pods
 
    // Since this is called in syncLoop, we don't need to call it anywhere else
 
    if kl.dnsConfigurer != nil && kl.dnsConfigurer.ResolverConfig != "" {
 
        kl.dnsConfigurer.CheckLimitsForResolvConf()
 
    }
 
    for {
 
        if err := kl.runtimeState.runtimeErrors(); err != nil {
 
            klog.Errorf("skipping pod synchronization - %v", err)
 
            // exponential backoff
 
            time.Sleep(duration)
 
            duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
 
            continue
 
        }
 
        // reset backoff if we have a success
 
        duration = base
 
        kl.syncLoopMonitor.Store(kl.clock.Now())
 
        if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
 
            break
 
        }
 
        kl.syncLoopMonitor.Store(kl.clock.Now())
 
    }
 
}
复制代码

syncLoopIteration方法就是对上面提到的多个管道做遍历,如果发现管道中有消息就交给相应的handler去处理,这个函数主要参数为:

1.configCh: 获得Pod Update配置更新事件的管道,也就是上面提到的监听文件、 apiserver 和http url等汇聚起来的更新事件;

2.handler:是一个SyncHandler 类型的接口,该接口类型的定义如下,可以理解为定义了对于Pod的增删改、状态迁移(Reconcile)和清理的方法;

type SyncHandler interface {
        HandlePodAdditions(pods []*v1.Pod)
        HandlePodUpdates(pods []*v1.Pod)
        HandlePodRemoves(pods []*v1.Pod)
        HandlePodReconcile(pods []*v1.Pod)
        HandlePodSyncs(pods []*v1.Pod)
        HandlePodCleanups() error
}
复制代码

3.syncCh:周期性的从该管道同步Pod最新状态;

4.housekeepingCh:housekeeping事件的管道,负责Pod的清理工作;

5.plegCh:保存PLEG更新事件的管道;

另外,还从kubelet liveness manager的管道中获取Pod的liveness状态更新事件。

从大的结构去分析,这个函数使用了Golang中select-case的语法特性,该语法与传统的switch-case有比较大的区别:

1.select-case的选择分支只适用于通道的操作,无论是通道的发送还是通道的接收;并且是随机选择的顺序,假如通道中没有事件则会阻塞;

2.switch-case的选择分支支持的类型更多;并且是从上到下顺序查看是否满足条件。

主要的分支有如下几个:

1.configCh:这是最主要的一个选择分支,主要是从配置源获取更新(这里的更新是指配置的更新,而不是指单纯的Pod Update操作,对于Pod配置的增删改等操作都是指更新)并按照操作类型分发给对应的handler进行处理,比如增加Pod的操作为handler.HandlePodAdditions(u.Pods),后面会在func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) 这一回调函数中实现SyncHandler接口的方法。

2.plegCh:plegCh中对于非容器删除类型(不影响Pod的生命周期)的PLEG消息,会调用handler.HandlePodSyncs方法处理Pod由于PLEG事件而需要的同步。

3.syncCh和housekeepingCh:这两者都是存放时间点的管道类型syncCh <-chan time.Time, housekeepingCh <-chan time.Time,会根据syncLoop方法中定义的syncTicker和housekeepingTicker进行周期性的操作,也就是分别handler.HandlePodSyncs同步Pod和handler.HandlePodCleanups清理Pod这两个方法。

4.livenessManager.Updates():如果liveness manager检查某个容器状态为Failure即不可用状态,会立即调用handler.HandlePodSyncs进行Pod状态更新,以便决定下一步该如何操作。

/pkg/kubelet/kubelet.go

// syncLoopIteration reads from various channels and dispatches pods to the
 
// given handler.
 
//
 
// Arguments:
 
// 1.  configCh:       a channel to read config events from
 
// 2.  handler:        the SyncHandler to dispatch pods to
 
// 3.  syncCh:         a channel to read periodic sync events from
 
// 4.  housekeepingCh: a channel to read housekeeping events from
 
// 5.  plegCh:         a channel to read PLEG updates from
 
//
 
// Events are also read from the kubelet liveness manager's update channel.
 
//
 
// The workflow is to read from one of the channels, handle that event, and
 
// update the timestamp in the sync loop monitor.
 
//
 
// Here is an appropriate place to note that despite the syntactical
 
// similarity to the switch statement, the case statements in a select are
 
// evaluated in a pseudorandom order if there are multiple channels ready to
 
// read from when the select is evaluated.  In other words, case statements
 
// are evaluated in random order, and you can not assume that the case
 
// statements evaluate in order if multiple channels have events.
 
//
 
// With that in mind, in truly no particular order, the different channels
 
// are handled as follows:
 
//
 
// * configCh: dispatch the pods for the config change to the appropriate
 
//             handler callback for the event type
 
// * plegCh: update the runtime cache; sync pod
 
// * syncCh: sync all pods waiting for sync
 
// * housekeepingCh: trigger cleanup of pods
 
// * liveness manager: sync pods that have failed or in which one or more
 
//                     containers have failed liveness checks
 
func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
 
    syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
 
    select {
 
    case u, open := <-configCh:
 
        // Update from a config source; dispatch it to the right handler
 
        // callback.
 
        if !open {
 
            klog.Errorf("Update channel is closed. Exiting the sync loop.")
 
            return false
 
        }
 
 
 
 
        switch u.Op {
 
        case kubetypes.ADD:
 
            klog.V(2).Infof("SyncLoop (ADD, %q): %q", u.Source, format.Pods(u.Pods))
 
            // After restarting, kubelet will get all existing pods through
 
            // ADD as if they are new pods. These pods will then go through the
 
            // admission process and *may* be rejected. This can be resolved
 
            // once we have checkpointing.
 
            handler.HandlePodAdditions(u.Pods)
 
        case kubetypes.UPDATE:
 
            klog.V(2).Infof("SyncLoop (UPDATE, %q): %q", u.Source, format.PodsWithDeletionTimestamps(u.Pods))
 
            handler.HandlePodUpdates(u.Pods)
 
        case kubetypes.REMOVE:
 
            klog.V(2).Infof("SyncLoop (REMOVE, %q): %q", u.Source, format.Pods(u.Pods))
 
            handler.HandlePodRemoves(u.Pods)
 
        case kubetypes.RECONCILE:
 
            klog.V(4).Infof("SyncLoop (RECONCILE, %q): %q", u.Source, format.Pods(u.Pods))
 
            handler.HandlePodReconcile(u.Pods)
 
        case kubetypes.DELETE:
 
            klog.V(2).Infof("SyncLoop (DELETE, %q): %q", u.Source, format.Pods(u.Pods))
 
            // DELETE is treated as a UPDATE because of graceful deletion.
 
            handler.HandlePodUpdates(u.Pods)
 
        case kubetypes.RESTORE:
 
            klog.V(2).Infof("SyncLoop (RESTORE, %q): %q", u.Source, format.Pods(u.Pods))
 
            // These are pods restored from the checkpoint. Treat them as new
 
            // pods.
 
            handler.HandlePodAdditions(u.Pods)
 
        case kubetypes.SET:
 
            // TODO: Do we want to support this?
 
            klog.Errorf("Kubelet does not support snapshot update")
 
        }
 
 
 
 
        if u.Op != kubetypes.RESTORE {
 
            // If the update type is RESTORE, it means that the update is from
 
            // the pod checkpoints and may be incomplete. Do not mark the
 
            // source as ready.
 
 
 
 
            // Mark the source ready after receiving at least one update from the
 
            // source. Once all the sources are marked ready, various cleanup
 
            // routines will start reclaiming resources. It is important that this
 
            // takes place only after kubelet calls the update handler to process
 
            // the update to ensure the internal pod cache is up-to-date.
 
            kl.sourcesReady.AddSource(u.Source)
 
        }
 
    case e := <-plegCh:
 
        if isSyncPodWorthy(e) {
 
            // PLEG event for a pod; sync it.
 
            if pod, ok := kl.podManager.GetPodByUID(e.ID); ok {
 
                klog.V(2).Infof("SyncLoop (PLEG): %q, event: %#v", format.Pod(pod), e)
 
                handler.HandlePodSyncs([]*v1.Pod{pod})
 
            } else {
 
                // If the pod no longer exists, ignore the event.
 
                klog.V(4).Infof("SyncLoop (PLEG): ignore irrelevant event: %#v", e)
 
            }
 
        }
 
 
 
 
        if e.Type == pleg.ContainerDied {
 
            if containerID, ok := e.Data.(string); ok {
 
                kl.cleanUpContainersInPod(e.ID, containerID)
 
            }
 
        }
 
    case <-syncCh:
 
        // Sync pods waiting for sync
 
        podsToSync := kl.getPodsToSync()
 
        if len(podsToSync) == 0 {
 
            break
 
        }
 
        klog.V(4).Infof("SyncLoop (SYNC): %d pods; %s", len(podsToSync), format.Pods(podsToSync))
 
        handler.HandlePodSyncs(podsToSync)
 
    case update := <-kl.livenessManager.Updates():
 
        if update.Result == proberesults.Failure {
 
            // The liveness manager detected a failure; sync the pod.
 
 
 
 
            // We should not use the pod from livenessManager, because it is never updated after
 
            // initialization.
 
            pod, ok := kl.podManager.GetPodByUID(update.PodUID)
 
            if !ok {
 
                // If the pod no longer exists, ignore the update.
 
                klog.V(4).Infof("SyncLoop (container unhealthy): ignore irrelevant update: %#v", update)
 
                break
 
            }
 
            klog.V(1).Infof("SyncLoop (container unhealthy): %q", format.Pod(pod))
 
            handler.HandlePodSyncs([]*v1.Pod{pod})
 
        }
 
    case <-housekeepingCh:
 
        if !kl.sourcesReady.AllReady() {
 
            // If the sources aren't ready or volume manager has not yet synced the states,
 
            // skip housekeeping, as we may accidentally delete pods from unready sources.
 
            klog.V(4).Infof("SyncLoop (housekeeping, skipped): sources aren't ready yet.")
 
        } else {
 
            klog.V(4).Infof("SyncLoop (housekeeping)")
 
            if err := handler.HandlePodCleanups(); err != nil {
 
                klog.Errorf("Failed cleaning pods: %v", err)
 
            }
 
        }
 
    }
 
    return true
 
}
复制代码

二、接下来我们进入HandlePodAdditions这个接口方法的实现来分析下添加一个Pod的具体过程:

首先sort.Sort(sliceutils.PodsByCreationTime(pods))这句将提交过来需要添加的Pods分片中的多个Pod按照创建时间进行排序,保证越早创建的Pod越早被处理;

然后由于podManager是Pod的source of truth for the desired state,也就是podManager中是记录了所有Pod的一个预期状态,如果一个Pod在podManager中不存在了,那也意味着这个Pod已经被删除了,因此此时要调用podManager的GetPods方法和AddPod方法获取已经存在的Pods和添加新的Pod。AddPod()方法中会在pm.secretManager和pm.configMapManager中注册该Pod的相关secret和configMap的信息,如果有的话,并且保存Pod的相关信息,如UID和FullName等。

如果是Mirror Pod要调用单独的处理方法handleMirrorPod,所谓Mirror Pod不得不提到Static Pod的概念,Static Pod是通过配置文件或http方式通过kubelet监控的本地节点的Pod,与一般Pod不同之处在于,这种类型的Pod无法被kubelet和api server管理,并且无法迁移到其他节点,也无法跟任何副本控制器关联,从我个人理解也把它叫做脱管的Pod,而这种Pod可以在api server上创建一个副本供查看,这种副本的Pod就被称为Mirror Pod。

接下来会进行准入性审查,也就是调用canAdmitPod(activePods, pod)这个方法,会检查该Pod是否能在kubelet所在的节点启动,如果不满足要求,会直接拒绝并给出具体原因。

如果该Pod通过了准入检查,就表明可以运行在当前节点上,会调用dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)方法分配给worker做异步处理。

最后,通过probeManager.AddPod(pod)方法将该Pod加入probeManager,如果该Pod定义了StartupProbe(为v1.18新引入的启动探测器,用于处理慢启动应用,会优先于LivenessProbe和ReadinessProbe,以防止应用在启动阶段由于启动太慢而被误杀),LivenessProbe和ReadinessProbe健康检查规则的话,会启动goroutine进行周期性的检查。

/pkg/kubelet/kubelet.go

// HandlePodAdditions is the callback in SyncHandler for pods being added from
 
// a config source.
 
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
 
    start := kl.clock.Now()
 
    sort.Sort(sliceutils.PodsByCreationTime(pods))
 
    for _, pod := range pods {
 
        existingPods := kl.podManager.GetPods()
 
        // Always add the pod to the pod manager. Kubelet relies on the pod
 
        // manager as the source of truth for the desired state. If a pod does
 
        // not exist in the pod manager, it means that it has been deleted in
 
        // the apiserver and no action (other than cleanup) is required.
 
        kl.podManager.AddPod(pod)
 
        if kubetypes.IsMirrorPod(pod) {
 
            kl.handleMirrorPod(pod, start)
 
            continue
 
        }
 
        if !kl.podIsTerminated(pod) {
 
            // Only go through the admission process if the pod is not
 
            // terminated.
 
 
            // We failed pods that we rejected, so activePods include all admitted
 
            // pods that are alive.
 
            activePods := kl.filterOutTerminatedPods(existingPods)
 
            // Check if we can admit the pod; if not, reject it.
 
            if ok, reason, message := kl.canAdmitPod(activePods, pod); !ok {
 
                kl.rejectPod(pod, reason, message)
 
                continue
 
            }
 
        }
 
        mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
 
        kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
 
        kl.probeManager.AddPod(pod)
 
    }
 
}
复制代码

下面将着重对准入性检查,分发给worker异步执行及将Pod添加到probeManager这三部分进行说明:

三、准入性检查

准入性检查主要确认该Pod是否能运行在Kubelet所在的节点上,主要由函数canAdmitPod(pods []*v1.Pod, pod *v1.Pod)实现,第一个参数为podManager中已经准入并且非终止状态的Pods集合,第二个参数是准备添加的Pod,返回值包含一个布尔值表示是否准入,以及无法准入时的原因。

接下来会循环遍历admitHandlers中定义的准入规则,假如发现某一条准入不满足的话就会立即返回,表示这个新增加的Pod不允许创建在本节点上。其中admitHandlers是lifecycle.PodAdmitHandlers类型的变量,而PodAdmitHandlers是新定义的[]PodAdmitHandler类型,就是元素为PodAdmitHandler的分片,而PodAdmitHandler是如下的接口类型,因此只要实现了Admit方法并且参数为PodAdmitAttributes返回结果为PodAdmitResult的方法都是这个接口的实现。实现了这个接口的方法比较多,也就是需要检查的准入规则也比较多,大概有evictionAdmitHandler,sysctlsWhitelist,GetAllocateResourcesPodAdmitHandler和NewPredicateAdmitHandler等几种,由于篇幅关系不再详述。

type PodAdmitHandler interface {

// Admit evaluates if a pod can be admitted.
Admit(attrs *PodAdmitAttributes) PodAdmitResult
}
复制代码

/pkg/kubelet/kubelet.go

// canAdmitPod determines if a pod can be admitted, and gives a reason if it
 
// cannot. "pod" is new pod, while "pods" are all admitted pods
 
// The function returns a boolean value indicating whether the pod
 
// can be admitted, a brief single-word reason and a message explaining why
 
// the pod cannot be admitted.
 
func (kl *Kubelet) canAdmitPod(pods []*v1.Pod, pod *v1.Pod) (bool, string, string) {
 
    // the kubelet will invoke each pod admit handler in sequence
 
    // if any handler rejects, the pod is rejected.
 
    // TODO: move out of disk check into a pod admitter
 
    // TODO: out of resource eviction should have a pod admitter call-out
 
    attrs := &lifecycle.PodAdmitAttributes{Pod: pod, OtherPods: pods}
 
    for _, podAdmitHandler := range kl.admitHandlers {
 
        if result := podAdmitHandler.Admit(attrs); !result.Admit {
 
            return false, result.Reason, result.Message
 
        }
 
    }
 
    return true, "", ""
 
}
复制代码

四、分发给worker异步执行:

当满足了准入条件后,就到了实际最核心的具体执行环节了。

dispatchWork这个方法主要用于根据Pod将任务分发给具体的pod worker去异步执行,这个方法比较简单,首先会判断这个任务的Pod是否已经终止了,如果终止了只会做一些清理工作和状态更新,对于其他情况,会去调用podWorkers中的UpdatePod方法,这个方法的参数是将dispatchWork接收到的参数重新封装成指向UpdatePodOptions这个结构体的指针,所以重点是要看一下UpdatePod这个方法的具体实现了。

/pkg/kubelet/kubelet.go

// dispatchWork starts the asynchronous sync of the pod in a pod worker.
 
// If the pod has completed termination, dispatchWork will perform no action.
 
func (kl *Kubelet) dispatchWork(pod *v1.Pod, syncType kubetypes.SyncPodType, mirrorPod *v1.Pod, start time.Time) {
 
    // check whether we are ready to delete the pod from the API server (all status up to date)
 
    containersTerminal, podWorkerTerminal := kl.podAndContainersAreTerminal(pod)
 
    if pod.DeletionTimestamp != nil && containersTerminal {
 
        klog.V(4).Infof("Pod %q has completed execution and should be deleted from the API server: %s", format.Pod(pod), syncType)
 
        kl.statusManager.TerminatePod(pod)
 
        return
 
    }
 
    // optimization: avoid invoking the pod worker if no further changes are possible to the pod definition
 
    if podWorkerTerminal {
 
        klog.V(4).Infof("Pod %q has completed, ignoring remaining sync work: %s", format.Pod(pod), syncType)
 
        return
 
    }
 
    // Run the sync in an async worker.
 
    kl.podWorkers.UpdatePod(&UpdatePodOptions{
 
        Pod:        pod,
 
        MirrorPod:  mirrorPod,
 
        UpdateType: syncType,
 
        OnCompleteFunc: func(err error) {
 
            if err != nil {
 
                metrics.PodWorkerDuration.WithLabelValues(syncType.String()).Observe(metrics.SinceInSeconds(start))
 
            }
 
        },
 
    })
 
    // Note the number of containers for new pods.
 
    if syncType == kubetypes.SyncPodCreate {
 
        metrics.ContainersPerPodCount.Observe(float64(len(pod.Spec.Containers)))
 
    }
 
}
复制代码

UpdatePod这个方法是在pod_workers.go这个文件中实现的,首先是会对Pod更新操作进行加锁,podUpdates是pod worker的一个map类型的属性(podUpdates map[types.UID]chan UpdatePodOptions),其键为Pod的UID,其值为UpdatePodOptions类型的管道。

接下来会判断podUpdates这个map中是否存在这个UID的Pod(if podUpdates, exists = p.podUpdates[uid]; 此句exists是golang中map的语法,exists是布尔类型保留字判断某个key是否存在),因为现在是添加Pod的场景,因此exists返回false。对于每一个新增加的Pod或kubelet重启的场景下,kubelet都会启动一个新的协程来创建一个新的pod worker来负责后面的工作,从代码中看出也就是调用managePodLoop(podUpdates)这个方法。

/pkg/kubelet/pod_workers.go

// Apply the new setting to the specified pod.
 
// If the options provide an OnCompleteFunc, the function is invoked if the update is accepted.
 
// Update requests are ignored if a kill pod request is pending.
 
func (p *podWorkers) UpdatePod(options *UpdatePodOptions) {
 
    pod := options.Pod
 
    uid := pod.UID
 
    var podUpdates chan UpdatePodOptions
 
    var exists bool
 
 
    p.podLock.Lock()
 
    defer p.podLock.Unlock()
 
    if podUpdates, exists = p.podUpdates[uid]; !exists {
 
        // We need to have a buffer here, because checkForUpdates() method that
 
        // puts an update into channel is called from the same goroutine where
 
        // the channel is consumed. However, it is guaranteed that in such case
 
        // the channel is empty, so buffer of size 1 is enough.
 
        podUpdates = make(chan UpdatePodOptions, 1)
 
        p.podUpdates[uid] = podUpdates
 
 
        // Creating a new pod worker either means this is a new pod, or that the
 
        // kubelet just restarted. In either case the kubelet is willing to believe
 
        // the status of the pod for the first pod worker sync. See corresponding
 
        // comment in syncPod.
 
        go func() {
 
            defer runtime.HandleCrash()
 
            p.managePodLoop(podUpdates)
 
        }()
 
    }
 
    if !p.isWorking[pod.UID] {
 
        p.isWorking[pod.UID] = true
 
        podUpdates <- *options
 
    } else {
 
        // if a request to kill a pod is pending, we do not let anything overwrite that request.
 
        update, found := p.lastUndeliveredWorkUpdate[pod.UID]
 
        if !found || update.UpdateType != kubetypes.SyncPodKill {
 
            p.lastUndeliveredWorkUpdate[pod.UID] = *options
 
        }
 
    }
 
}
复制代码

managePodLoop中的核心代码是syncPodFn这个方法,可以看到管理Pod生命周期的功能最终还是落到同步Pod这个方法中了,当然,kubelet是有缓存机制的,只有当缓存中的数据比上次同步时间发生了更新才会触发同步操作。而syncPodFn是一个syncPodFnType类型的属性(syncPodFn syncPodFnType),syncPodFnType类型又是一个函数类型(type syncPodFnType func(options syncPodOptions) ,error),更为关键的是syncPodFn这个函数在newPodWorkers初始化时已经在kubelet代码中被kubelet初始化为了klet.syncPod这个函数,因此,最终回到了分析syncPod这个函数中来。

/pkg/kubelet/pod_workers.go

func (p *podWorkers) managePodLoop(podUpdates <-chan UpdatePodOptions) {
 
    var lastSyncTime time.Time
 
    for update := range podUpdates {
 
        err := func() error {
 
            podUID := update.Pod.UID
 
            // This is a blocking call that would return only if the cache
 
            // has an entry for the pod that is newer than minRuntimeCache
 
            // Time. This ensures the worker doesn't start syncing until
 
            // after the cache is at least newer than the finished time of
 
            // the previous sync.
 
            status, err := p.podCache.GetNewerThan(podUID, lastSyncTime)
 
            if err != nil {
 
                // This is the legacy event thrown by manage pod loop
 
                // all other events are now dispatched from syncPodFn
 
                p.recorder.Eventf(update.Pod, v1.EventTypeWarning, events.FailedSync, "error determining status: %v", err)
 
                return err
 
            }
 
            err = p.syncPodFn(syncPodOptions{
 
                mirrorPod:      update.MirrorPod,
 
                pod:            update.Pod,
 
                podStatus:      status,
 
                killPodOptions: update.KillPodOptions,
 
                updateType:     update.UpdateType,
 
            })
 
            lastSyncTime = time.Now()
 
            return err
 
        }()
 
        // notify the call-back function if the operation succeeded or not
 
        if update.OnCompleteFunc != nil {
 
            update.OnCompleteFunc(err)
 
        }
 
        if err != nil {
 
            // IMPORTANT: we do not log errors here, the syncPodFn is responsible for logging errors
 
            klog.Errorf("Error syncing pod %s (%q), skipping: %v", update.Pod.UID, format.Pod(update.Pod), err)
 
        }
 
        p.wrapUp(update.Pod.UID, err)
 
    }
 
}
复制代码

五、同步Pod阶段:

syncPod函数的代码非常长,可以看到有大段的注释在里面,描述了整个的工作流程:

1.如果是删除Pod,则更新api sever状态,立即执行并返回;

2.检查Pod是否能运行在本节点,也就是准入检查,但这里的准入检查与之前提到的有区别,在kubelet中称为软准入检查(与之相对的,之前的检查可称之为硬准入检查),所谓软准入检查是Pod已经被这个节点上的kubelet接受了但在运行前的检查,如果被软准入检查拒绝了,Pod会在那个节点上呈现无限期的Pending状态,所以你可以理解为软准入检查是种可以分配但不允许运行的检查。软准入检查主要都是是否能使用主机网络模式、特权模式、proc挂载等跟实际运行相关的检查,主要有NewAppArmorAdmitHandler,NewNoNewPrivsAdmitHandler,NewProcMountAdmitHandler等几种。在这种函数中,如果不满足软准入检查,会被直接kill掉并返回错误信息。

3.如果启用了cgroups-per-qos这个参数(默认启用),则会创建Qos和Pod级别的Cgroups以进行资源限制。

4.如果是static pod,就创建或更新对应的mirror pod。

5.创建pod的数据目录,主要包括pod目录、volume目录和plugin目录的信息。

6.volume manager再后台等待所有的volume attach和mount完成。

7.如果有secret信息,从api server获取这个pod的secret信息。

8.最后也是最重要的,调用容器运行时containerRuntime的SyncPod方法去实现真正的容器创建逻辑。

/pkg/kubelet/kubelet.go

// syncPod is the transaction script for the sync of a single pod.
 
//
 
// Arguments:
 
//
 
// o - the SyncPodOptions for this invocation
 
//
 
// The workflow is:
 
// * If the pod is being created, record pod worker start latency
 
// * Call generateAPIPodStatus to prepare an v1.PodStatus for the pod
 
// * If the pod is being seen as running for the first time, record pod
 
//   start latency
 
// * Update the status of the pod in the status manager
 
// * Kill the pod if it should not be running
 
// * Create a mirror pod if the pod is a static pod, and does not
 
//   already have a mirror pod
 
// * Create the data directories for the pod if they do not exist
 
// * Wait for volumes to attach/mount
 
// * Fetch the pull secrets for the pod
 
// * Call the container runtime's SyncPod callback
 
// * Update the traffic shaping for the pod's ingress and egress limits
 
//
 
// If any step of this workflow errors, the error is returned, and is repeated
 
// on the next syncPod call.
 
//
 
// This operation writes all events that are dispatched in order to provide
 
// the most accurate information possible about an error situation to aid debugging.
 
// Callers should not throw an event if this operation returns an error.
 
func (kl *Kubelet) syncPod(o syncPodOptions) error {
 
    // pull out the required options
 
    pod := o.pod
 
    mirrorPod := o.mirrorPod
 
    podStatus := o.podStatus
 
    updateType := o.updateType
 
    // if we want to kill a pod, do it now!
 
    if updateType == kubetypes.SyncPodKill {
 
        killPodOptions := o.killPodOptions
 
        if killPodOptions == nil || killPodOptions.PodStatusFunc == nil {
 
            return fmt.Errorf("kill pod options are required if update type is kill")
 
        }
 
        apiPodStatus := killPodOptions.PodStatusFunc(pod, podStatus)
 
        kl.statusManager.SetPodStatus(pod, apiPodStatus)
 
        // we kill the pod with the specified grace period since this is a termination
 
        if err := kl.killPod(pod, nil, podStatus, killPodOptions.PodTerminationGracePeriodSecondsOverride); err != nil {
 
            kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToKillPod, "error killing pod: %v", err)
 
            // there was an error killing the pod, so we return that error directly
 
            utilruntime.HandleError(err)
 
            return err
 
        }
 
        return nil
 
    }
 
    // Latency measurements for the main workflow are relative to the
 
    // first time the pod was seen by the API server.
 
    var firstSeenTime time.Time
 
    if firstSeenTimeStr, ok := pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey]; ok {
 
        firstSeenTime = kubetypes.ConvertToTimestamp(firstSeenTimeStr).Get()
 
    }
 
    // Record pod worker start latency if being created
 
    // TODO: make pod workers record their own latencies
 
    if updateType == kubetypes.SyncPodCreate {
 
        if !firstSeenTime.IsZero() {
 
            // This is the first time we are syncing the pod. Record the latency
 
            // since kubelet first saw the pod if firstSeenTime is set.
 
            metrics.PodWorkerStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
 
        } else {
 
            klog.V(3).Infof("First seen time not recorded for pod %q", pod.UID)
 
        }
 
    }
 
 
    // Generate final API pod status with pod and status manager status
 
    apiPodStatus := kl.generateAPIPodStatus(pod, podStatus)
 
    // The pod IP may be changed in generateAPIPodStatus if the pod is using host network. (See #24576)
 
    // TODO(random-liu): After writing pod spec into container labels, check whether pod is using host network, and
 
    // set pod IP to hostIP directly in runtime.GetPodStatus
 
    podStatus.IPs = make([]string, 0, len(apiPodStatus.PodIPs))
 
    for _, ipInfo := range apiPodStatus.PodIPs {
 
        podStatus.IPs = append(podStatus.IPs, ipInfo.IP)
 
    }
 
    if len(podStatus.IPs) == 0 && len(apiPodStatus.PodIP) > 0 {
 
        podStatus.IPs = []string{apiPodStatus.PodIP}
 
    }
 
    // Record the time it takes for the pod to become running.
 
    existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
 
    if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
 
        !firstSeenTime.IsZero() {
 
        metrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))
 
    }
 
    runnable := kl.canRunPod(pod)
 
    if !runnable.Admit {
 
        // Pod is not runnable; update the Pod and Container statuses to why.
 
        apiPodStatus.Reason = runnable.Reason
 
        apiPodStatus.Message = runnable.Message
 
        // Waiting containers are not creating.
 
        const waitingReason = "Blocked"
 
        for _, cs := range apiPodStatus.InitContainerStatuses {
 
            if cs.State.Waiting != nil {
 
                cs.State.Waiting.Reason = waitingReason
 
            }
 
        }
 
        for _, cs := range apiPodStatus.ContainerStatuses {
 
            if cs.State.Waiting != nil {
 
                cs.State.Waiting.Reason = waitingReason
 
            }
 
        }
 
    }
 
 
    // Update status in the status manager
 
    kl.statusManager.SetPodStatus(pod, apiPodStatus)
 
 
    // Kill pod if it should not be running
 
    if !runnable.Admit || pod.DeletionTimestamp != nil || apiPodStatus.Phase == v1.PodFailed {
 
        var syncErr error
 
        if err := kl.killPod(pod, nil, podStatus, nil); err != nil {
 
            kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToKillPod, "error killing pod: %v", err)
 
            syncErr = fmt.Errorf("error killing pod: %v", err)
 
            utilruntime.HandleError(syncErr)
 
        } else {
 
            if !runnable.Admit {
 
                // There was no error killing the pod, but the pod cannot be run.
 
                // Return an error to signal that the sync loop should back off.
 
                syncErr = fmt.Errorf("pod cannot be run: %s", runnable.Message)
 
            }
 
        }
 
        return syncErr
 
    }
 
 
 
 
    // If the network plugin is not ready, only start the pod if it uses the host network
 
    if err := kl.runtimeState.networkErrors(); err != nil && !kubecontainer.IsHostNetworkPod(pod) {
 
        kl.recorder.Eventf(pod, v1.EventTypeWarning, events.NetworkNotReady, "%s: %v", NetworkNotReadyErrorMsg, err)
 
        return fmt.Errorf("%s: %v", NetworkNotReadyErrorMsg, err)
 
    }
 
    // Create Cgroups for the pod and apply resource parameters
 
    // to them if cgroups-per-qos flag is enabled.
 
    pcm := kl.containerManager.NewPodContainerManager()
 
    // If pod has already been terminated then we need not create
 
    // or update the pod's cgroup
 
    if !kl.podIsTerminated(pod) {
 
        // When the kubelet is restarted with the cgroups-per-qos
 
        // flag enabled, all the pod's running containers
 
        // should be killed intermittently and brought back up
 
        // under the qos cgroup hierarchy.
 
        // Check if this is the pod's first sync
 
        firstSync := true
 
        for _, containerStatus := range apiPodStatus.ContainerStatuses {
 
            if containerStatus.State.Running != nil {
 
                firstSync = false
 
                break
 
            }
 
        }
 
        // Don't kill containers in pod if pod's cgroups already
 
        // exists or the pod is running for the first time
 
        podKilled := false
 
        if !pcm.Exists(pod) && !firstSync {
 
            if err := kl.killPod(pod, nil, podStatus, nil); err == nil {
 
                podKilled = true
 
            }
 
        }
 
        // Create and Update pod's Cgroups
 
        // Don't create cgroups for run once pod if it was killed above
 
        // The current policy is not to restart the run once pods when
 
        // the kubelet is restarted with the new flag as run once pods are
 
        // expected to run only once and if the kubelet is restarted then
 
        // they are not expected to run again.
 
        // We don't create and apply updates to cgroup if its a run once pod and was killed above
 
        if !(podKilled && pod.Spec.RestartPolicy == v1.RestartPolicyNever) {
 
            if !pcm.Exists(pod) {
 
                if err := kl.containerManager.UpdateQOSCgroups(); err != nil {
 
                    klog.V(2).Infof("Failed to update QoS cgroups while syncing pod: %v", err)
 
                }
 
                if err := pcm.EnsureExists(pod); err != nil {
 
                    kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToCreatePodContainer, "unable to ensure pod container exists: %v", err)
 
                    return fmt.Errorf("failed to ensure that the pod: %v cgroups exist and are correctly applied: %v", pod.UID, err)
 
                }
 
            }
 
        }
 
    }
 
 
    // Create Mirror Pod for Static Pod if it doesn't already exist
 
    if kubetypes.IsStaticPod(pod) {
 
        podFullName := kubecontainer.GetPodFullName(pod)
 
        deleted := false
 
        if mirrorPod != nil {
 
            if mirrorPod.DeletionTimestamp != nil || !kl.podManager.IsMirrorPodOf(mirrorPod, pod) {
 
                // The mirror pod is semantically different from the static pod. Remove
 
                // it. The mirror pod will get recreated later.
 
                klog.Infof("Trying to delete pod %s %v", podFullName, mirrorPod.ObjectMeta.UID)
 
                var err error
 
                deleted, err = kl.podManager.DeleteMirrorPod(podFullName, &mirrorPod.ObjectMeta.UID)
 
                if deleted {
 
                    klog.Warningf("Deleted mirror pod %q because it is outdated", format.Pod(mirrorPod))
 
                } else if err != nil {
 
                    klog.Errorf("Failed deleting mirror pod %q: %v", format.Pod(mirrorPod), err)
 
                }
 
            }
 
        }
 
        if mirrorPod == nil || deleted {
 
            node, err := kl.GetNode()
 
            if err != nil || node.DeletionTimestamp != nil {
 
                klog.V(4).Infof("No need to create a mirror pod, since node %q has been removed from the cluster", kl.nodeName)
 
            } else {
 
                klog.V(4).Infof("Creating a mirror pod for static pod %q", format.Pod(pod))
 
                if err := kl.podManager.CreateMirrorPod(pod); err != nil {
 
                    klog.Errorf("Failed creating a mirror pod for %q: %v", format.Pod(pod), err)
 
                }
 
            }
 
        }
 
    }
 
    // Make data directories for the pod
 
    if err := kl.makePodDataDirs(pod); err != nil {
 
        kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToMakePodDataDirectories, "error making pod data directories: %v", err)
 
        klog.Errorf("Unable to make pod data directories for pod %q: %v", format.Pod(pod), err)
 
        return err
 
    }
 
    // Volume manager will not mount volumes for terminated pods
 
    if !kl.podIsTerminated(pod) {
 
        // Wait for volumes to attach/mount
 
        if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
 
            kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, "Unable to attach or mount volumes: %v", err)
 
            klog.Errorf("Unable to attach or mount volumes for pod %q: %v; skipping pod", format.Pod(pod), err)
 
            return err
 
        }
 
    }
 
 
    // Fetch the pull secrets for the pod
 
    pullSecrets := kl.getPullSecretsForPod(pod)
 
 
    // Call the container runtime's SyncPod callback
 
    result := kl.containerRuntime.SyncPod(pod, podStatus, pullSecrets, kl.backOff)
 
    kl.reasonCache.Update(pod.UID, result)
 
    if err := result.Error(); err != nil {
 
        // Do not return error if the only failures were pods in backoff
 
        for _, r := range result.SyncResults {
 
            if r.Error != kubecontainer.ErrCrashLoopBackOff && r.Error != images.ErrImagePullBackOff {
 
                // Do not record an event here, as we keep all event logging for sync pod failures
 
                // local to container runtime so we get better errors
 
                return err
 
            }
 
        }
 
        return nil
 
    }
 
    return nil
 
}
复制代码

六、容器运行时阶段:

如果你读到了这里,恭喜你,终于进入到了跟容器相关的部分了。当然所有上面的部分还有些细节没有详细描述的,如更新Pod的状态信息、生成Metrics指标供metrics server采集等等,大家可以自行分析源码。

上面提到的containerRuntime是kubelet的kubecontainer.Runtime类型的属性,而kubecontainer 是"k8s.io/kubernetes/pkg/kubelet/container"这个包的别名,而Runtime是一个接口类型,只要实现了这个接口中的方法的容器运行时都是这个接口的实现。而/pkg/kubelet/kuberuntime/kuberuntime_manager.go中的KubeGenericRuntime接口就包含了kubecontainer.Runtime这个接口,这个文件中的NewKubeGenericRuntimeManager函数就初始化了kubeRuntimeManager来实现了这个接口。

注:这部分抽象程度比较高,在早期的kubelet中是由docker_manager.go等具体的容器运行时来分别实现上面的逻辑的,但在当前版本中可以看到这部分已经高度抽象化了,变成了kuberuntime_manager.go

type KubeGenericRuntime interface {
    kubecontainer.Runtime
    kubecontainer.StreamingRuntime
    kubecontainer.ContainerCommandRunner
}
复制代码

继续看kuberuntime_manager中的SyncPod方法,这个方法也比较长,它主要通过一系列步骤保证运行着的Pod向预期中的Pod状态进行同步和迁移,当然对于添加Pod这个过程,就变成了从无到有的过程了。主要步骤大概有:

1.计算沙箱(sandbox)和容器的变更,这里沙箱又叫做Infra容器或Pause容器,这个沙箱为其他容器提供共享的网络和文件资源,也是在Pod中第一个启动的容器,可以理解为基础设施容器;

Pod的状态主要包含sandbox容器状态、初始化容器状态、临时容器状态、业务容器状态等几部分,其中computePodActions这个方法又会调用podSandboxChanged这个函数计算是否需要重建沙箱,当且仅当有且只有一个沙箱并且沙箱具有自己的IP并且沙箱的网络命名空间没有改变时才不需要重建,否则都要重建或新建沙箱,在构建过程中会Kill掉Pod中的所有容器。

对于其他容器的变更计算比较复杂,篇幅关系这里不再赘述。

2.如果沙箱发生改变杀死Pod;

3.杀死任意不需要运行的容器;

4.如果有必要的话创建沙箱容器;

5.启动临时容器(ephemeral),临时容器时一种特殊的容器,其启动顺序可以早于初始化容器,一般主要用于交互式的故障排查。

6.启动初始化容器(init);

初始化容器,顾名思义容器启动的时候,会先启动一个或多个容器,如果有多个,那么这几个Init Container按照定义的顺序依次执行,一个执行成功,才能执行下一个,只有所有的Init Container执行完后,主容器才会启动。由于一个Pod里的存储卷是共享的,所以Init Container里产生的数据可以被主容器使用到。

初始化容器的应用场景有:

    等待其它模块Ready:比如有一个应用里面有两个容器化的服务,一个是Web Server,另一个是数据库。其中Web Server需要访问数据库。但是当我们启动这个应用的时候,并不能保证数据库服务先启动起来,所以可能出现在一段时间内Web Server连接数据库错误。为了解决这个问题,我们可以在运行Web Server服务的Pod里使用一个InitContainer,去检查数据库是否准备好,直到数据库可以连接,Init Container才结束退出,然后Web Server容器被启动,发起正式的数据库连接请求。

    初始化配置:比如集群里检测所有已经存在的成员节点,为主容器准备好集群的配置信息,这样主容器起来后就能用这个配置信息加入集群;目前在容器化,初始化集群配置文件时经常用到;

    提供一种阻塞容器启动的方式:必须在initContainer容器启动成功后,才会运行下一个容器,保证了一组条件运行成功的方式;

    其它使用场景:将pod注册到一个中央数据库、下载应用依赖等。
复制代码

7.启动其他业务容器。

/pkg/kubelet/kuberuntime/kuberuntime_manager.go

// SyncPod syncs the running pod into the desired pod by executing following steps:
 
//
 
//  1. Compute sandbox and container changes.
 
//  2. Kill pod sandbox if necessary.
 
//  3. Kill any containers that should not be running.
 
//  4. Create sandbox if necessary.
 
//  5. Create ephemeral containers.
 
//  6. Create init containers.
 
//  7. Create normal containers.
 
func (m *kubeGenericRuntimeManager) SyncPod(pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, backOff *flowcontrol.Backoff) (result kubecontainer.PodSyncResult) {
 
    // Step 1: Compute sandbox and container changes.
 
    podContainerChanges := m.computePodActions(pod, podStatus)
 
    klog.V(3).Infof("computePodActions got %+v for pod %q", podContainerChanges, format.Pod(pod))
 
    if podContainerChanges.CreateSandbox {
 
        ref, err := ref.GetReference(legacyscheme.Scheme, pod)
 
        if err != nil {
 
            klog.Errorf("Couldn't make a ref to pod %q: '%v'", format.Pod(pod), err)
 
        }
 
        if podContainerChanges.SandboxID != "" {
 
            m.recorder.Eventf(ref, v1.EventTypeNormal, events.SandboxChanged, "Pod sandbox changed, it will be killed and re-created.")
 
        } else {
 
            klog.V(4).Infof("SyncPod received new pod %q, will create a sandbox for it", format.Pod(pod))
 
        }
 
    }
 
 
 
 
    // Step 2: Kill the pod if the sandbox has changed.
 
    if podContainerChanges.KillPod {
 
        if podContainerChanges.CreateSandbox {
 
            klog.V(4).Infof("Stopping PodSandbox for %q, will start new one", format.Pod(pod))
 
        } else {
 
            klog.V(4).Infof("Stopping PodSandbox for %q because all other containers are dead.", format.Pod(pod))
 
        }
 
 
 
 
        killResult := m.killPodWithSyncResult(pod, kubecontainer.ConvertPodStatusToRunningPod(m.runtimeName, podStatus), nil)
 
        result.AddPodSyncResult(killResult)
 
        if killResult.Error() != nil {
 
            klog.Errorf("killPodWithSyncResult failed: %v", killResult.Error())
 
            return
 
        }
 
 
 
 
        if podContainerChanges.CreateSandbox {
 
            m.purgeInitContainers(pod, podStatus)
 
        }
 
    } else {
 
        // Step 3: kill any running containers in this pod which are not to keep.
 
        for containerID, containerInfo := range podContainerChanges.ContainersToKill {
 
            klog.V(3).Infof("Killing unwanted container %q(id=%q) for pod %q", containerInfo.name, containerID, format.Pod(pod))
 
            killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, containerInfo.name)
 
            result.AddSyncResult(killContainerResult)
 
            if err := m.killContainer(pod, containerID, containerInfo.name, containerInfo.message, nil); err != nil {
 
                killContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())
 
                klog.Errorf("killContainer %q(id=%q) for pod %q failed: %v", containerInfo.name, containerID, format.Pod(pod), err)
 
                return
 
            }
 
        }
 
    }
 
 
 
 
    // Keep terminated init containers fairly aggressively controlled
 
    // This is an optimization because container removals are typically handled
 
    // by container garbage collector.
 
    m.pruneInitContainersBeforeStart(pod, podStatus)
 
 
 
    // We pass the value of the PRIMARY podIP and list of podIPs down to
 
    // generatePodSandboxConfig and generateContainerConfig, which in turn
 
    // passes it to various other functions, in order to facilitate functionality
 
    // that requires this value (hosts file and downward API) and avoid races determining
 
    // the pod IP in cases where a container requires restart but the
 
    // podIP isn't in the status manager yet. The list of podIPs is used to
 
    // generate the hosts file.
 
    //
 
    // We default to the IPs in the passed-in pod status, and overwrite them if the
 
    // sandbox needs to be (re)started.
 
    var podIPs []string
 
    if podStatus != nil {
 
        podIPs = podStatus.IPs
 
    }
 
 
    // Step 4: Create a sandbox for the pod if necessary.
 
    podSandboxID := podContainerChanges.SandboxID
 
    if podContainerChanges.CreateSandbox {
 
        var msg string
 
        var err error
 
        klog.V(4).Infof("Creating sandbox for pod %q", format.Pod(pod))
 
        createSandboxResult := kubecontainer.NewSyncResult(kubecontainer.CreatePodSandbox, format.Pod(pod))
 
        result.AddSyncResult(createSandboxResult)
 
        podSandboxID, msg, err = m.createPodSandbox(pod, podContainerChanges.Attempt)
 
        if err != nil {
 
            createSandboxResult.Fail(kubecontainer.ErrCreatePodSandbox, msg)
 
            klog.Errorf("createPodSandbox for pod %q failed: %v", format.Pod(pod), err)
 
            ref, referr := ref.GetReference(legacyscheme.Scheme, pod)
 
            if referr != nil {
 
                klog.Errorf("Couldn't make a ref to pod %q: '%v'", format.Pod(pod), referr)
 
            }
 
            m.recorder.Eventf(ref, v1.EventTypeWarning, events.FailedCreatePodSandBox, "Failed to create pod sandbox: %v", err)
 
            return
 
        }
 
        klog.V(4).Infof("Created PodSandbox %q for pod %q", podSandboxID, format.Pod(pod))
 
        podSandboxStatus, err := m.runtimeService.PodSandboxStatus(podSandboxID)
 
        if err != nil {
 
            ref, referr := ref.GetReference(legacyscheme.Scheme, pod)
 
            if referr != nil {
 
                klog.Errorf("Couldn't make a ref to pod %q: '%v'", format.Pod(pod), referr)
 
            }
 
            m.recorder.Eventf(ref, v1.EventTypeWarning, events.FailedStatusPodSandBox, "Unable to get pod sandbox status: %v", err)
 
            klog.Errorf("Failed to get pod sandbox status: %v; Skipping pod %q", err, format.Pod(pod))
 
            result.Fail(err)
 
            return
 
        }
 
        // If we ever allow updating a pod from non-host-network to
 
        // host-network, we may use a stale IP.
 
        if !kubecontainer.IsHostNetworkPod(pod) {
 
            // Overwrite the podIPs passed in the pod status, since we just started the pod sandbox.
 
            podIPs = m.determinePodSandboxIPs(pod.Namespace, pod.Name, podSandboxStatus)
 
            klog.V(4).Infof("Determined the ip %v for pod %q after sandbox changed", podIPs, format.Pod(pod))
 
        }
 
    }
 
 
    // the start containers routines depend on pod ip(as in primary pod ip)
 
    // instead of trying to figure out if we have 0 < len(podIPs)
 
    // everytime, we short circuit it here
 
    podIP := ""
 
    if len(podIPs) != 0 {
 
        podIP = podIPs[0]
 
    }
 
    // Get podSandboxConfig for containers to start.
 
    configPodSandboxResult := kubecontainer.NewSyncResult(kubecontainer.ConfigPodSandbox, podSandboxID)
 
    result.AddSyncResult(configPodSandboxResult)
 
    podSandboxConfig, err := m.generatePodSandboxConfig(pod, podContainerChanges.Attempt)
 
    if err != nil {
 
        message := fmt.Sprintf("GeneratePodSandboxConfig for pod %q failed: %v", format.Pod(pod), err)
 
        klog.Error(message)
 
        configPodSandboxResult.Fail(kubecontainer.ErrConfigPodSandbox, message)
 
        return
 
    }
 
 
    // Helper containing boilerplate common to starting all types of containers.
 
    // typeName is a label used to describe this type of container in log messages,
 
    // currently: "container", "init container" or "ephemeral container"
 
    start := func(typeName string, spec *startSpec) error {
 
        startContainerResult := kubecontainer.NewSyncResult(kubecontainer.StartContainer, spec.container.Name)
 
        result.AddSyncResult(startContainerResult)
 
 
        isInBackOff, msg, err := m.doBackOff(pod, spec.container, podStatus, backOff)
 
        if isInBackOff {
 
            startContainerResult.Fail(err, msg)
 
            klog.V(4).Infof("Backing Off restarting %v %+v in pod %v", typeName, spec.container, format.Pod(pod))
 
            return err
 
        }
 
 
        klog.V(4).Infof("Creating %v %+v in pod %v", typeName, spec.container, format.Pod(pod))
 
        // NOTE (aramase) podIPs are populated for single stack and dual stack clusters. Send only podIPs.
 
        if msg, err := m.startContainer(podSandboxID, podSandboxConfig, spec, pod, podStatus, pullSecrets, podIP, podIPs); err != nil {
 
            startContainerResult.Fail(err, msg)
 
            // known errors that are logged in other places are logged at higher levels here to avoid
 
            // repetitive log spam
 
            switch {
 
            case err == images.ErrImagePullBackOff:
 
                klog.V(3).Infof("%v start failed: %v: %s", typeName, err, msg)
 
            default:
 
                utilruntime.HandleError(fmt.Errorf("%v start failed: %v: %s", typeName, err, msg))
 
            }
 
            return err
 
        }
 
        return nil
 
    }
 
    // Step 5: start ephemeral containers
 
    // These are started "prior" to init containers to allow running ephemeral containers even when there
 
    // are errors starting an init container. In practice init containers will start first since ephemeral
 
    // containers cannot be specified on pod creation.
 
    if utilfeature.DefaultFeatureGate.Enabled(features.EphemeralContainers) {
 
        for _, idx := range podContainerChanges.EphemeralContainersToStart {
 
            start("ephemeral container", ephemeralContainerStartSpec(&pod.Spec.EphemeralContainers[idx]))
 
        }
 
    }
 
    // Step 6: start the init container.
 
    if container := podContainerChanges.NextInitContainerToStart; container != nil {
 
        // Start the next init container.
 
        if err := start("init container", containerStartSpec(container)); err != nil {
 
            return
 
        }
 
        // Successfully started the container; clear the entry in the failure
 
        klog.V(4).Infof("Completed init container %q for pod %q", container.Name, format.Pod(pod))
 
    }
 
 
    // Step 7: start containers in podContainerChanges.ContainersToStart.
 
    for _, idx := range podContainerChanges.ContainersToStart {
 
        start("container", containerStartSpec(&pod.Spec.Containers[idx]))
 
    }
 
    return
 
}
复制代码

七、启动容器:

以启动业务容器为例,上一节Step 7可以看到启动业务容器是在podContainerChanges.ContainersToStart中进行循环,ContainersToStart 是个整形分片,保存了需要启动的容器列表的index。其中start函数主要调用了kubecontainer中的StartContainer方法。

StartContainer方法主要实现了启动容器并返回可能的异常信息,它的主要步骤如下:

1.拉取容器镜像;

通过EnsureImageExists方法拉取拉取指定pod容器的镜像,并返回镜像信息和错误。最终是通过特定的容器运行时落地的,比如最常见的使用docker的情况,会使用kubelet中dockershim/docker_image.go中的PullImage方法。

2.创建容器;

首先生成container的*v1.ObjectReference对象,该对象包括container的相关信息。统计container的重启次数,新的容器默认重启次数为0。然后生成container的配置并调用runtimeService,执行CreateContainer的操作。最后执行启动前的准备工作。

3.启动容器;

执行runtimeService的StartContainer方法,来启动容器。

4.执行启动后的钩子函数做收尾工作。

如果有容器的Lifecycle.PostStart,则执行启动后的收尾工作。

/pkg/kubelet/kuberuntime/kuberuntime_container.go

// startContainer starts a container and returns a message indicates why it is failed on error.
 
// It starts the container through the following steps:
 
// * pull the image
 
// * create the container
 
// * start the container
 
// * run the post start lifecycle hooks (if applicable)
 
func (m *kubeGenericRuntimeManager) startContainer(podSandboxID string, podSandboxConfig *runtimeapi.PodSandboxConfig, spec *startSpec, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, podIP string, podIPs []string) (string, error) {
 
    container := spec.container
 
 
    // Step 1: pull the image.
 
    imageRef, msg, err := m.imagePuller.EnsureImageExists(pod, container, pullSecrets, podSandboxConfig)
 
    if err != nil {
 
        s, _ := grpcstatus.FromError(err)
 
        m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", s.Message())
 
        return msg, err
 
    }
 
 
    // Step 2: create the container.
 
    ref, err := kubecontainer.GenerateContainerRef(pod, container)
 
    if err != nil {
 
        klog.Errorf("Can't make a ref to pod %q, container %v: %v", format.Pod(pod), container.Name, err)
 
    }
 
    klog.V(4).Infof("Generating ref for container %s: %#v", container.Name, ref)
 
 
 
    // For a new container, the RestartCount should be 0
 
    restartCount := 0
 
    containerStatus := podStatus.FindContainerStatusByName(container.Name)
 
    if containerStatus != nil {
 
        restartCount = containerStatus.RestartCount + 1
 
    }
 
 
    target, err := spec.getTargetID(podStatus)
 
    if err != nil {
 
        s, _ := grpcstatus.FromError(err)
 
        m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", s.Message())
 
        return s.Message(), ErrCreateContainerConfig
 
    }
 
 
    containerConfig, cleanupAction, err := m.generateContainerConfig(container, pod, restartCount, podIP, imageRef, podIPs, target)
 
    if cleanupAction != nil {
 
        defer cleanupAction()
 
    }
 
    if err != nil {
 
        s, _ := grpcstatus.FromError(err)
 
        m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", s.Message())
 
        return s.Message(), ErrCreateContainerConfig
 
    }
 
 
    containerID, err := m.runtimeService.CreateContainer(podSandboxID, containerConfig, podSandboxConfig)
 
    if err != nil {
 
        s, _ := grpcstatus.FromError(err)
 
        m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", s.Message())
 
        return s.Message(), ErrCreateContainer
 
    }
 
    err = m.internalLifecycle.PreStartContainer(pod, container, containerID)
 
    if err != nil {
 
        s, _ := grpcstatus.FromError(err)
 
        m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, "Internal PreStartContainer hook failed: %v", s.Message())
 
        return s.Message(), ErrPreStartHook
 
    }
 
    m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.CreatedContainer, fmt.Sprintf("Created container %s", container.Name))
 
 
    if ref != nil {
 
        m.containerRefManager.SetRef(kubecontainer.ContainerID{
 
            Type: m.runtimeName,
 
            ID:   containerID,
 
        }, ref)
 
    }
 
 
    // Step 3: start the container.
 
    err = m.runtimeService.StartContainer(containerID)
 
    if err != nil {
 
        s, _ := grpcstatus.FromError(err)
 
        m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, "Error: %v", s.Message())
 
        return s.Message(), kubecontainer.ErrRunContainer
 
    }
 
    m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.StartedContainer, fmt.Sprintf("Started container %s", container.Name))
 
 
    // Symlink container logs to the legacy container log location for cluster logging
 
    // support.
 
    // TODO(random-liu): Remove this after cluster logging supports CRI container log path.
 
    containerMeta := containerConfig.GetMetadata()
 
    sandboxMeta := podSandboxConfig.GetMetadata()
 
    legacySymlink := legacyLogSymlink(containerID, containerMeta.Name, sandboxMeta.Name,
 
        sandboxMeta.Namespace)
 
    containerLog := filepath.Join(podSandboxConfig.LogDirectory, containerConfig.LogPath)
 
    // only create legacy symlink if containerLog path exists (or the error is not IsNotExist).
 
    // Because if containerLog path does not exist, only dangling legacySymlink is created.
 
    // This dangling legacySymlink is later removed by container gc, so it does not make sense
 
    // to create it in the first place. it happens when journald logging driver is used with docker.
 
    if _, err := m.osInterface.Stat(containerLog); !os.IsNotExist(err) {
 
        if err := m.osInterface.Symlink(containerLog, legacySymlink); err != nil {
 
            klog.Errorf("Failed to create legacy symbolic link %q to container %q log %q: %v",
 
                legacySymlink, containerID, containerLog, err)
 
        }
 
    }
 
 
    // Step 4: execute the post start hook.
 
    if container.Lifecycle != nil && container.Lifecycle.PostStart != nil {
 
        kubeContainerID := kubecontainer.ContainerID{
 
            Type: m.runtimeName,
 
            ID:   containerID,
 
        }
 
        msg, handlerErr := m.runner.Run(kubeContainerID, pod, container, container.Lifecycle.PostStart)
 
        if handlerErr != nil {
 
            m.recordContainerEvent(pod, container, kubeContainerID.ID, v1.EventTypeWarning, events.FailedPostStartHook, msg)
 
            if err := m.killContainer(pod, kubeContainerID, container.Name, "FailedPostStartHook", nil); err != nil {
 
                klog.Errorf("Failed to kill container %q(id=%q) in pod %q: %v, %v",
 
                    container.Name, kubeContainerID.String(), format.Pod(pod), ErrPostStartHook, err)
 
            }
 
            return msg, fmt.Errorf("%s: %v", ErrPostStartHook, handlerErr)
 
        }
 
    }
 
 
    return "", nil
 
}
复制代码
文章分类
后端
文章标签