Kubelet ImageGC原理分析

1,530 阅读7分钟

Image GC是什么?

Image GC是kubelet的镜像清理功能,用于在磁盘空间不足的情况下清除不需要的镜像,释放磁盘空间,保证Pod能正常启动运行。

Image GC如何使用?

Kubelet默认开启,通过kubele启动配置中的ImageGCPolicy控制。ImageGCPolicy有三个设置参数:

  • ImageGCHighThresholdPercent:触发gc的阈值,超过该值将会执行gc,设置为100时,gc不启动。
  • ImageGCLowThresholdPercent:ImageGC执行空间空间的目标值,gc触发后,将会将磁盘占用率降至该值以下;
  • ImageMinimumGCAge:最短GC年龄(即距离首次被探测到的间隔),小于该阈值时不会被gc。

源码分析

ImageGC的初始化与启动

在kubelet启动时,ImageGC的启动在BirthCry执行完成之后。

func (kl *Kubelet) StartGarbageCollection() {
	loggedContainerGCFailure := false
    
    // container gc流程,省略
    ...

	// ImageGCHighThresholdPercent设置为100时,关闭image gc
	if kl.kubeletConfiguration.ImageGCHighThresholdPercent == 100 {
		klog.V(2).Infof("ImageGCHighThresholdPercent is set 100, Disable image GC")
		return
	}

	prevImageGCFailed := false
	go wait.Until(func() {
		if err := kl.imageManager.GarbageCollect(); err != nil {
			if prevImageGCFailed {
				klog.Errorf("Image garbage collection failed multiple times in a row: %v", err)
				// Only create an event for repeated failures
				kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error())
			} else {
				klog.Errorf("Image garbage collection failed once. Stats initialization may not have completed yet: %v", err)
			}
			prevImageGCFailed = true
		} else {
			var vLevel klog.Level = 4
			if prevImageGCFailed {
				vLevel = 1
				prevImageGCFailed = false
			}

			klog.V(vLevel).Infof("Image garbage collection succeeded")
		}
	}, ImageGCPeriod, wait.NeverStop)
}

可以看到,ImageGC由单独的协程执行,默认的执行间隔为五分钟。当ImageGC首次执行失败时会打印日志,而重复失败后,会记录一个ImageGCFailed的事件。这意味着可以通过配置日志或者告警了解GC是否正常运行。 接下来看看ImageGCManager的具体实现。

type ImageGCManager interface {
	// Applies the garbage collection policy. Errors include being unable to free
	// enough space as per the garbage collection policy.
	GarbageCollect() error

	// Start async garbage collection of images.
	Start()

	GetImageList() ([]container.Image, error)

	// Delete all unused images.
	DeleteUnusedImages() error
}

func NewImageGCManager(runtime container.Runtime, statsProvider StatsProvider, recorder record.EventRecorder, nodeRef *v1.ObjectReference, policy ImageGCPolicy, sandboxImage string) (ImageGCManager, error) {
	// Validate policy.
	if policy.HighThresholdPercent < 0 || policy.HighThresholdPercent > 100 {
		return nil, fmt.Errorf("invalid HighThresholdPercent %d, must be in range [0-100]", policy.HighThresholdPercent)
	}
	if policy.LowThresholdPercent < 0 || policy.LowThresholdPercent > 100 {
		return nil, fmt.Errorf("invalid LowThresholdPercent %d, must be in range [0-100]", policy.LowThresholdPercent)
	}
	if policy.LowThresholdPercent > policy.HighThresholdPercent {
		return nil, fmt.Errorf("LowThresholdPercent %d can not be higher than HighThresholdPercent %d", policy.LowThresholdPercent, policy.HighThresholdPercent)
	}
	im := &realImageGCManager{
		runtime:       runtime,
		policy:        policy,
		imageRecords:  make(map[string]*imageRecord),
		statsProvider: statsProvider,
		recorder:      recorder,
		nodeRef:       nodeRef,
		initialized:   false,
		sandboxImage:  sandboxImage,
	}

	return im, nil
}

ImageGCManager的接口非常简单,只有四个方法:

  • GarbageCollect:根据定义的ImageGCPolicy执行具体的清理动作;
  • Start:异步地收集镜像信息;
  • GetImageList:获取缓存中的镜像列表;
  • DeleteUnusedImages:删除未使用的镜像。

ImageGCManager在初始化时会校验Policy的参数合法性,然后传递运行时、监控、事件等参数。然后看看Start方法的逻辑:

func (im *realImageGCManager) Start() {
	go wait.Until(func() {
		// Initial detection make detected time "unknown" in the past.
		var ts time.Time
		if im.initialized {
			ts = time.Now()
		}
		_, err := im.detectImages(ts)
		if err != nil {
			klog.Warningf("[imageGCManager] Failed to monitor images: %v", err)
		} else {
			im.initialized = true
		}
	}, 5*time.Minute, wait.NeverStop)

	// Start a goroutine periodically updates image cache.
	go wait.Until(func() {
		images, err := im.runtime.ListImages()
		if err != nil {
			klog.Warningf("[imageGCManager] Failed to update image list: %v", err)
		} else {
			im.imageCache.set(images)
		}
	}, 30*time.Second, wait.NeverStop)

}

ImageGCManager的Start方法会启动两个协程。在第一个协程内,每隔五分钟Manager会检查一次镜像。一旦完成一次,Manager的状态就会被标记为已初始化。另一个协程每隔30秒会从容器运行时获取所有的镜像信息,更新到缓存的镜像列表中。

镜像信息的检测和维护

那么,Manager时如何检测镜像的呢?

func (im *realImageGCManager) detectImages(detectTime time.Time) (sets.String, error) {
	imagesInUse := sets.NewString()

	// Always consider the container runtime pod sandbox image in use
	imageRef, err := im.runtime.GetImageRef(container.ImageSpec{Image: im.sandboxImage})
	if err == nil && imageRef != "" {
		imagesInUse.Insert(imageRef)
	}

	images, err := im.runtime.ListImages()
	if err != nil {
		return imagesInUse, err
	}
	pods, err := im.runtime.GetPods(true)
	if err != nil {
		return imagesInUse, err
	}

	// Make a set of images in use by containers.
	for _, pod := range pods {
		for _, container := range pod.Containers {
			klog.V(5).Infof("Pod %s/%s, container %s uses image %s(%s)", pod.Namespace, pod.Name, container.Name, container.Image, container.ImageID)
			imagesInUse.Insert(container.ImageID)
		}
	}

	// Add new images and record those being used.
	now := time.Now()
	currentImages := sets.NewString()
	im.imageRecordsLock.Lock()
	defer im.imageRecordsLock.Unlock()
	for _, image := range images {
		klog.V(5).Infof("Adding image ID %s to currentImages", image.ID)
		currentImages.Insert(image.ID)

		// New image, set it as detected now.
		if _, ok := im.imageRecords[image.ID]; !ok {
			klog.V(5).Infof("Image ID %s is new", image.ID)
			im.imageRecords[image.ID] = &imageRecord{
				firstDetected: detectTime,
			}
		}

		// Set last used time to now if the image is being used.
		if isImageUsed(image.ID, imagesInUse) {
			klog.V(5).Infof("Setting Image ID %s lastUsed to %v", image.ID, now)
			im.imageRecords[image.ID].lastUsed = now
		}

		klog.V(5).Infof("Image ID %s has size %d", image.ID, image.Size)
		im.imageRecords[image.ID].size = image.Size
	}

	// Remove old images from our records.
	for image := range im.imageRecords {
		if !currentImages.Has(image) {
			klog.V(5).Infof("Image ID %s is no longer present; removing from imageRecords", image)
			delete(im.imageRecords, image)
		}
	}

	return imagesInUse, nil
}

检测镜像的目的是找出正在使用的镜像,防止在GC执行的过程中被清理。同时,在此过程中,镜像的清理需要参考一些信息,这些信息也会在检测的过程中更新。 首先,Sandbox镜像是一定会被判定为正在使用的镜像。接着会将所有Pod的所有正在运行中的容器使用的image加到正在使用的镜像列表中。注意,即使Pod有容器需要该镜像,但是该容器未处于Running状态,其对应的镜像也会被清理。 选出正在使用(即不会被清理)的镜像之后,会将容器运行时中获取到的镜像列表信息更新到Manager维护的镜像列表记录中。 查询所有新获取的镜像列表信息进行遍历,分为以下几步:

  • 如果是第一次被记录,那么更新该镜像的首次被探测时间为本轮探测的事件
  • 如果被前面一步被判定为“正在使用的镜像”,那么它的最新使用事件会被刷新为当前时间
  • 刷新获取到的镜像的大小

最后,如果某个镜像已经不在容器运行时返回的镜像列表中,就会被移出Manager缓存的镜像探测记录。

ImageGC的具体执行

ImageGCManager的核心方法就是GarbageCollect了,主要步骤如下:首先获取Image对应的Filesystem占用信息,根据启动的配置计算出用量百分比以及需要释放的空间大小,然后开始释放。如果实际释放的空间小于目标大小,会记录FreeDiskSpaceFailed的Warnning事件。

func (im *realImageGCManager) GarbageCollect() error {
	// Get disk usage on disk holding images.
	fsStats, err := im.statsProvider.ImageFsStats()
	if err != nil {
		return err
	}

	var capacity, available int64
	if fsStats.CapacityBytes != nil {
		capacity = int64(*fsStats.CapacityBytes)
	}
	if fsStats.AvailableBytes != nil {
		available = int64(*fsStats.AvailableBytes)
	}

	if available > capacity {
		klog.Warningf("available %d is larger than capacity %d", available, capacity)
		available = capacity
	}

	// Check valid capacity.
	if capacity == 0 {
		err := goerrors.New("invalid capacity 0 on image filesystem")
		im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())
		return err
	}

	// If over the max threshold, free enough to place us at the lower threshold.
	usagePercent := 100 - int(available*100/capacity)
	if usagePercent >= im.policy.HighThresholdPercent {
		amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available
		klog.Infof("[imageGCManager]: Disk usage on image filesystem is at %d%% which is over the high threshold (%d%%). Trying to free %d bytes down to the low threshold (%d%%).", usagePercent, im.policy.HighThresholdPercent, amountToFree, im.policy.LowThresholdPercent)
		freed, err := im.freeSpace(amountToFree, time.Now())
		if err != nil {
			return err
		}

		if freed < amountToFree {
			err := fmt.Errorf("failed to garbage collect required amount of images. Wanted to free %d bytes, but freed %d bytes", amountToFree, freed)
			im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.FreeDiskSpaceFailed, err.Error())
			return err
		}
	}

	return nil
}

算出需要释放空间后是删除的镜像是怎么决定的呢?在开始执行清理时,会执行我们上面介绍的镜像探测过程。在完成镜像探测后,我们的得到的imagesInUse包括了Sanbox镜像以及Pod内正在与运行中容器使用的镜像。接下来,要选出清理的目标镜像,存放清理目标的数据结构叫evictionInfo,它存放了所有不在imagesInUse列表内的镜像记录。接着会将这些镜像记录按照 最后使用时间首次探测时间 进行一次排序,即按照LRU规则将最后一次使用时间较早和探测事件较早的镜像排在前面。 排序完之后,会遍历所有这些镜像:如果是镜像最后一次使用事件没有删除触发时间早(即刚刚刷新了最后使用时间),则不会删除。同时,如果该镜像首次被探测到的时间差小于配置的最小GC间隔(即刚加入到缓存记录中),也不会删除。否则,就会依序删除这些镜像,删除完之后会从探测记录中删除该镜像同时累加 已经释放的空间值spaceFreed。如果spaceFreed不小于目标释放的空间,则本轮的清理正常结束。

func (im *realImageGCManager) freeSpace(bytesToFree int64, freeTime time.Time) (int64, error) {
	imagesInUse, err := im.detectImages(freeTime)
	if err != nil {
		return 0, err
	}

	im.imageRecordsLock.Lock()
	defer im.imageRecordsLock.Unlock()

	// Get all images in eviction order.
	images := make([]evictionInfo, 0, len(im.imageRecords))
	for image, record := range im.imageRecords {
		if isImageUsed(image, imagesInUse) {
			klog.V(5).Infof("Image ID %s is being used", image)
			continue
		}
		images = append(images, evictionInfo{
			id:          image,
			imageRecord: *record,
		})
	}
	sort.Sort(byLastUsedAndDetected(images))

	// Delete unused images until we've freed up enough space.
	var deletionErrors []error
	spaceFreed := int64(0)
	for _, image := range images {
		klog.V(5).Infof("Evaluating image ID %s for possible garbage collection", image.id)
		// Images that are currently in used were given a newer lastUsed.
		if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) {
			klog.V(5).Infof("Image ID %s has lastUsed=%v which is >= freeTime=%v, not eligible for garbage collection", image.id, image.lastUsed, freeTime)
			continue
		}

		// Avoid garbage collect the image if the image is not old enough.
		// In such a case, the image may have just been pulled down, and will be used by a container right away.

		if freeTime.Sub(image.firstDetected) < im.policy.MinAge {
			klog.V(5).Infof("Image ID %s has age %v which is less than the policy's minAge of %v, not eligible for garbage collection", image.id, freeTime.Sub(image.firstDetected), im.policy.MinAge)
			continue
		}

		// Remove image. Continue despite errors.
		klog.Infof("[imageGCManager]: Removing image %q to free %d bytes", image.id, image.size)
		err := im.runtime.RemoveImage(container.ImageSpec{Image: image.id})
		if err != nil {
			deletionErrors = append(deletionErrors, err)
			continue
		}
		delete(im.imageRecords, image.id)
		spaceFreed += image.size

		if spaceFreed >= bytesToFree {
			break
		}
	}

	if len(deletionErrors) > 0 {
		return spaceFreed, fmt.Errorf("wanted to free %d bytes, but freed %d bytes space with errors in image deletion: %v", bytesToFree, spaceFreed, errors.NewAggregate(deletionErrors))
	}
	return spaceFreed, nil
}

磁盘驱逐与ImageGC

除了上述GC逻辑外,实际上还有额外的ImageGC触发条件。在运行中,偶尔会遇到ImageGCHighThresholdPercent被设置为100但还是有镜像被清理的情况。我们反过来看下在上文提到的ImageGC的接口,可以看到DeleteUnusedImages是个public方法。

func buildSignalToNodeReclaimFuncs(imageGC ImageGC, containerGC ContainerGC, withImageFs bool) map[evictionapi.Signal]nodeReclaimFuncs {
	signalToReclaimFunc := map[evictionapi.Signal]nodeReclaimFuncs{}
	// usage of an imagefs is optional
	if withImageFs {
		// with an imagefs, nodefs pressure should just delete logs
		signalToReclaimFunc[evictionapi.SignalNodeFsAvailable] = nodeReclaimFuncs{}
		signalToReclaimFunc[evictionapi.SignalNodeFsInodesFree] = nodeReclaimFuncs{}
		// with an imagefs, imagefs pressure should delete unused images
		signalToReclaimFunc[evictionapi.SignalImageFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
		signalToReclaimFunc[evictionapi.SignalImageFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
	} else {
		// without an imagefs, nodefs pressure should delete logs, and unused images
		// since imagefs and nodefs share a common device, they share common reclaim functions
		signalToReclaimFunc[evictionapi.SignalNodeFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
		signalToReclaimFunc[evictionapi.SignalNodeFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
		signalToReclaimFunc[evictionapi.SignalImageFsAvailable] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
		signalToReclaimFunc[evictionapi.SignalImageFsInodesFree] = nodeReclaimFuncs{containerGC.DeleteAllUnusedContainers, imageGC.DeleteUnusedImages}
	}
	return signalToReclaimFunc
}

实际上,在磁盘满导致节点驱逐信号触发时会直接调用容器和镜像的GC方法,毕竟节点驱逐的触发是更紧急的。

总结

总的来看,Kubelet会在节点驱逐信号触发Image对应的Filesystem空间不足的情况下删除冗余的镜像。整个GC的要点如下:

  • 清理的触发为到达HighThresholdPercent开始清理,一直清理到LowThresholdPercent为止。但是需要注意的是通过将HighThresholdPercent设置为100关闭GC的做法对节点驱逐不生效,只能关闭定时清理任务
  • 镜像清理过程中,有三类镜像不会被清除:
    • Sanbox所需镜像;
    • GC首次探测和刚被刷新过最后使用时间的镜像;
    • 探测累计时长小于MinimumGCAge的镜像。
  • 清理过程会优先清除最久没用到的和最早探测到的镜像。

原文博客地址:study4.fun/2021-04-25-…