编译版本: v1.14.3
git repo: github.com/NVIDIA/nvid…
1.概述
通过查看根目录的Makefile文件,可以看出整个镜像分多步,主要是两块工作:
- 通过docker build将nvidia-container-toolkit 的各个可执行文件做成多个rpm 包 (Makefile)
- 基于rpm包,在docker build中安装制作最终的container-toolkit镜像
2.rpm打包过程分析
2.1 makefile执行过程分析
make docker-native -n 会根据架构自动选择arm或者x86进行编译
make docker-x86_64 -n 会看到x86包的编译过程
make docker-aarch64 -n 会看到arm包编译过程 make docker-arm64 -n (arm64 aarch64 都对应arm,我要编译centos7的,就使用docker-aarch64, 如果是ubuntu ,就用arm64)
参见:Makefile引入了docker/docker.mk
make docker-aarch64 -n 打印:
echo "Building for centos7-aarch64"
docker pull --platform=linux/aarch64 centos:7
DOCKER_BUILDKIT=1 \
docker build \
--platform=linux/aarch64 \
--progress=plain \
--build-arg BASEIMAGE="centos:7" \
--build-arg GOLANG_VERSION="1.20.5" \
--build-arg PKG_NAME="nvidia-container-toolkit" \
--build-arg PKG_VERS="1.14.3" \
--build-arg PKG_REV="1" \
--build-arg LIBNVIDIA_CONTAINER_TOOLS_VERSION="1.14.3-1" \
--build-arg GIT_COMMIT="53b24618a542025b108239fe602e66e912b7d6e2" \
--tag nvidia/nvidia-container-toolkit/centos7-aarch64 \
--file /home/ccadmin/fengsp/nvidia-container-toolkit/docker/Dockerfile.rpm-yum .
docker run \
--platform=linux/aarch64 \
-e DISTRIB \
-e SECTION \
-v /home/ccadmin/fengsp/nvidia-container-toolkit/dist/centos7/aarch64:/dist \
nvidia/nvidia-container-toolkit/centos7-aarch64
echo "Building for centos8-aarch64"
docker pull --platform=linux/aarch64 quay.io/centos/centos:stream8
DOCKER_BUILDKIT=1 \
docker build \
--platform=linux/aarch64 \
--progress=plain \
--build-arg BASEIMAGE="quay.io/centos/centos:stream8" \
--build-arg GOLANG_VERSION="1.20.5" \
--build-arg PKG_NAME="nvidia-container-toolkit" \
--build-arg PKG_VERS="1.14.3" \
--build-arg PKG_REV="1" \
--build-arg LIBNVIDIA_CONTAINER_TOOLS_VERSION="1.14.3-1" \
--build-arg GIT_COMMIT="53b24618a542025b108239fe602e66e912b7d6e2" \
--tag nvidia/nvidia-container-toolkit/centos8-aarch64 \
--file /home/ccadmin/fengsp/nvidia-container-toolkit/docker/Dockerfile.rpm-yum .
docker run \
--platform=linux/aarch64 \
-e DISTRIB \
-e SECTION \
-v /home/ccadmin/fengsp/nvidia-container-toolkit/dist/centos8/aarch64:/dist \
nvidia/nvidia-container-toolkit/centos8-aarch64
echo "Building for rhel8-aarch64"
docker pull --platform=linux/aarch64 quay.io/centos/centos:stream8
DOCKER_BUILDKIT=1 \
docker build \
--platform=linux/aarch64 \
--progress=plain \
--build-arg BASEIMAGE="quay.io/centos/centos:stream8" \
--build-arg GOLANG_VERSION="1.20.5" \
--build-arg PKG_NAME="nvidia-container-toolkit" \
--build-arg PKG_VERS="1.14.3" \
--build-arg PKG_REV="1" \
--build-arg LIBNVIDIA_CONTAINER_TOOLS_VERSION="1.14.3-1" \
--build-arg GIT_COMMIT="53b24618a542025b108239fe602e66e912b7d6e2" \
--tag nvidia/nvidia-container-toolkit/centos8-aarch64 \
--file /home/ccadmin/fengsp/nvidia-container-toolkit/docker/Dockerfile.rpm-yum .
docker run \
--platform=linux/aarch64 \
-e DISTRIB \
-e SECTION \
-v /home/ccadmin/fengsp/nvidia-container-toolkit/dist/rhel8/aarch64:/dist \
nvidia/nvidia-container-toolkit/centos8-aarch64
echo "Building for amazonlinux2-aarch64"
docker pull --platform=linux/aarch64 amazonlinux:2
DOCKER_BUILDKIT=1 \
docker build \
--platform=linux/aarch64 \
--progress=plain \
--build-arg BASEIMAGE="amazonlinux:2" \
--build-arg GOLANG_VERSION="1.20.5" \
--build-arg PKG_NAME="nvidia-container-toolkit" \
--build-arg PKG_VERS="1.14.3" \
--build-arg PKG_REV="1" \
--build-arg LIBNVIDIA_CONTAINER_TOOLS_VERSION="1.14.3-1" \
--build-arg GIT_COMMIT="53b24618a542025b108239fe602e66e912b7d6e2" \
--tag nvidia/nvidia-container-toolkit/amazonlinux2-aarch64 \
--file /home/ccadmin/fengsp/nvidia-container-toolkit/docker/Dockerfile.rpm-yum .
docker run \
--platform=linux/aarch64 \
-e DISTRIB \
-e SECTION \
-v /home/ccadmin/fengsp/nvidia-container-toolkit/dist/amazonlinux2/aarch64:/dist \
nvidia/nvidia-container-toolkit/amazonlinux2-aarch64
rm docker-build-amazonlinux2-aarch64 docker-build-centos7-aarch64 docker-build-rhel8-aarch64 docker-build-centos8-aarch64
2.2 centos7版rpm生成过程分析
目前我们只使用centos7, 就只需要看centos7的部分
echo "Building for centos7-aarch64"
#拉取镜像
docker pull --platform=linux/aarch64 centos:7
# 生成nvidia/nvidia-container-toolkit/centos7-aarch64镜像
DOCKER_BUILDKIT=1 \
docker build \
--platform=linux/aarch64 \
--progress=plain \
--build-arg BASEIMAGE="centos:7" \
--build-arg GOLANG_VERSION="1.20.5" \
--build-arg PKG_NAME="nvidia-container-toolkit" \
--build-arg PKG_VERS="1.14.3" \
--build-arg PKG_REV="1" \
--build-arg LIBNVIDIA_CONTAINER_TOOLS_VERSION="1.14.3-1" \
--build-arg GIT_COMMIT="53b24618a542025b108239fe602e66e912b7d6e2" \
--tag nvidia/nvidia-container-toolkit/centos7-aarch64 \
--file /home/ccadmin/fengsp/nvidia-container-toolkit/docker/Dockerfile.rpm-yum .
# 运行nvidia/nvidia-container-toolkit/centos7-aarch64镜像,将生产的rpm存放在dist目录
docker run \
--platform=linux/aarch64 \
-e DISTRIB \
-e SECTION \
-v /home/ccadmin/fengsp/nvidia-container-toolkit/dist/centos7/aarch64:/dist \
nvidia/nvidia-container-toolkit/centos7-aarch64
可以看到主要分为3步:
- 拉取centos7 基础镜像
- 基于nvidia-container-toolkit/docker/Dockerfile.rpm-yum dockerfile文件做镜像构建 (这里指定了DOCKER_BUILDKIT=1,需要使用18.09+的 docker ce 版本)
- 运行第二步生成的镜像,做rpm包,生成的rpm包放在本地挂载的dist目录下
Dockerfile.rpm-yum 分析:
ARG BASEIMAGE
FROM ${BASEIMAGE}
# 注释1.如果是centos8 可替换阿里源
#RUN wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-8.repo
#RUN rm -rf /etc/yum.repos.d/
#COPY Centos-8.repo /etc/yum.repos.d/CentOS-Base.repo
# 注释2. 安装基本的编译工具
RUN yum install -y \
ca-certificates \
gcc \
wget \
git \
make \
rpm-build && \
rm -rf /var/cache/yum/*
# 注释3. 安装golang
ARG GOLANG_VERSION=0.0.0
RUN set -eux; \
\
arch="$(uname -m)"; \
case "${arch##*-}" in \
x86_64 | amd64) ARCH='amd64' ;; \
ppc64el | ppc64le) ARCH='ppc64le' ;; \
aarch64) ARCH='arm64' ;; \
*) echo "unsupported architecture"; exit 1 ;; \
esac; \
wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-${ARCH}.tar.gz \
| tar -C /usr/local -xz
ENV GOPATH /go
ENV PATH $GOPATH/bin:/usr/local/go/bin:$PATH
# packaging
ARG PKG_NAME
ARG PKG_VERS
ARG PKG_REV
ENV PKG_NAME ${PKG_NAME}
ENV PKG_VERS ${PKG_VERS}
ENV PKG_REV ${PKG_REV}
# output directory
ENV DIST_DIR=/tmp/nvidia-container-toolkit-$PKG_VERS/SOURCES
RUN mkdir -p $DIST_DIR /dist
# nvidia-container-toolkit
WORKDIR $GOPATH/src/nvidia-container-toolkit
COPY . .
ARG GIT_COMMIT
ENV GIT_COMMIT ${GIT_COMMIT}
# 注释4.执行 make cmds 对cmd目录下各个子模块编译生成可执行文件, 可make cmds -n 查看编译过程
RUN make PREFIX=${DIST_DIR} cmds
# 注释5. 拷贝repo中的 rpm打包spec文件到镜像中,核心文件是packaging/rpm//SPECS/nvidia-container-toolkit.spec
WORKDIR $DIST_DIR/..
COPY packaging/rpm .
ARG LIBNVIDIA_CONTAINER_TOOLS_VERSION
ENV LIBNVIDIA_CONTAINER_TOOLS_VERSION ${LIBNVIDIA_CONTAINER_TOOLS_VERSION}
# 注释6, 指定容器镜像的启动脚本,即容器运行时,默认执行rpmbuild 打包操作
CMD arch=$(uname -m) && \
rpmbuild --clean --target=$arch -bb \
-D "_topdir $PWD" \
-D "release_date $(date +'%a %b %d %Y')" \
-D "git_commit ${GIT_COMMIT}" \
-D "version ${PKG_VERS}" \
-D "libnvidia_container_tools_version ${LIBNVIDIA_CONTAINER_TOOLS_VERSION}" \
-D "release ${PKG_REV}" \
SPECS/nvidia-container-toolkit.spec && \
mv RPMS/$arch/*.rpm /dist
3.使用rpm包构建container-toolkit镜像过程
这里需要将 build/container/Makefile拷贝到根目录覆盖第一阶段rpm build使用的Makefile
因为我们要编译 nvcr.io/nvidia/k8s/container-toolkit:v1.14.3-ubi8对应的版本,可以执行
make build-ubi8 -n 去查看对应的镜像构建过程:
DOCKER_BUILDKIT=1 \
docker build --pull \
\
--platform=linux/amd64 \
--tag nvidia/container-toolkit:1.14.3-ubi8 \
--build-arg ARTIFACTS_ROOT="dist" \
--build-arg BASE_DIST="ubi8" \
--build-arg CUDA_VERSION="12.2.2" \
--build-arg GOLANG_VERSION="1.20.5" \
--build-arg LIBNVIDIA_CONTAINER0_VERSION="" \
--build-arg PACKAGE_DIST="centos7" \
--build-arg PACKAGE_VERSION="1.14.3" \
--build-arg VERSION="1.14.3" \
--build-arg GIT_COMMIT="53b24618a542025b108239fe602e66e912b7d6e2-dirty" \
--build-arg GIT_COMMIT_SHORT="53b2461" \
--build-arg GIT_BRANCH="HEAD" \
--build-arg SOURCE_DATE_EPOCH="1697714542" \
-f /home/ccadmin/fengsp/nvidia-container-toolkit/build/container/Dockerfile.centos \
/home/ccadmin/fengsp/nvidia-container-toolkit
这里可以看到,核心的dockerfile文件是:
nvidia-container-toolkit/build/container/Dockerfile.centos
# nvidia-container-toolkit/build/container/Dockerfile.centos
ARG BASE_DIST
ARG CUDA_VERSION
ARG GOLANG_VERSION=x.x.x
ARG VERSION="N/A"
#注释1:对应ubi8, 基础镜像就是nvidia/cuda:12.2.2-base-ubi8
FROM nvidia/cuda:${CUDA_VERSION}-base-${BASE_DIST} as build
#FROM nvcr.io/nvidia/cuda:${CUDA_VERSION}-base-${BASE_DIST} as build
# 注释2 下载基本依赖包,并安装golang 编译环境
RUN yum install -y \
wget make git gcc \
&& \
rm -rf /var/cache/yum/*
ARG GOLANG_VERSION=x.x.x
RUN set -eux; \
\
arch="$(uname -m)"; \
case "${arch##*-}" in \
x86_64 | amd64) ARCH='amd64' ;; \
ppc64el | ppc64le) ARCH='ppc64le' ;; \
aarch64) ARCH='arm64' ;; \
*) echo "unsupported architecture" ; exit 1 ;; \
esac; \
wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-${ARCH}.tar.gz \
| tar -C /usr/local -xz
ENV GOPATH /go
ENV PATH $GOPATH/bin:/usr/local/go/bin:$PATH
WORKDIR /build
COPY . .
# NOTE: Until the config utilities are properly integrated into the
# nvidia-container-toolkit repository, these are built from the `tools` folder
# and not `cmd`.
# 注释3: 编译tools目录下的工具,生产可执行文件,包括containerd, crio, docker, nvidia-toolkit 等
RUN GOPATH=/artifacts go install -ldflags="-s -w -X 'main.Version=${VERSION}'" ./tools/...
#注释:对应ubi8, 基础镜像就是nvidia/cuda:12.2.2-base-ubi8
FROM nvidia/cuda:${CUDA_VERSION}-base-${BASE_DIST}
#FROM nvcr.io/nvidia/cuda:${CUDA_VERSION}-base-${BASE_DIST}
ARG BASE_DIST
# See https://www.centos.org/centos-linux-eol/
# and https://stackoverflow.com/a/70930049 for move to vault.centos.org
# and https://serverfault.com/questions/1093922/failing-to-run-yum-update-in-centos-8 for move to vault.epel.cloud
RUN [[ "${BASE_DIST}" != "centos8" ]] || \
( \
sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-Linux-* && \
sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.epel.cloud|g' /etc/yum.repos.d/CentOS-Linux-* \
)
ENV NVIDIA_DISABLE_REQUIRE="true"
ENV NVIDIA_VISIBLE_DEVICES=void
ENV NVIDIA_DRIVER_CAPABILITIES=utility
# 注释4, 这里将前面生成的nvidia-toolkit相关rpm包从本地的dist目录拷贝到容器/artifacts/packages/
# 注,这里不仅需要nvidia-toolkit的rpm包,还有libnvidia-container相关的依赖包 (https://github.com/NVIDIA/libnvidia-container)
ARG ARTIFACTS_ROOT
ARG PACKAGE_DIST
COPY ${ARTIFACTS_ROOT}/${PACKAGE_DIST} /artifacts/packages/${PACKAGE_DIST}
WORKDIR /artifacts/packages
# 注释5, 在这里执行了yum localinstall操作,把所有相关包进行本地安装
ARG PACKAGE_VERSION
ARG TARGETARCH
ENV PACKAGE_ARCH ${TARGETARCH}
RUN PACKAGE_ARCH=${PACKAGE_ARCH/amd64/x86_64} && PACKAGE_ARCH=${PACKAGE_ARCH/arm64/aarch64} && \
yum localinstall -y \
#${PACKAGE_DIST}/${PACKAGE_ARCH}/libnvidia-container1-1.*.rpm \
#${PACKAGE_DIST}/${PACKAGE_ARCH}/libnvidia-container-tools-1.*.rpm \
${PACKAGE_DIST}/${PACKAGE_ARCH}/nvidia-container-toolkit*-${PACKAGE_VERSION}*.rpm
WORKDIR /work
# 注释6, 将rpm本地安装后,全部的可执行文件拷贝到目标镜像的work目录,并把work目录设置到PATH,完成目标镜像的制作
COPY --from=build /artifacts/bin /work
ENV PATH=/work:$PATH
LABEL io.k8s.display-name="NVIDIA Container Runtime Config"
LABEL name="NVIDIA Container Runtime Config"
LABEL vendor="NVIDIA"
LABEL version="${VERSION}"
LABEL release="N/A"
LABEL summary="Automatically Configure your Container Runtime for GPU support."
LABEL description="See summary"
RUN mkdir /licenses && mv /NGC-DL-CONTAINER-LICENSE /licenses/NGC-DL-CONTAINER-LICENSE
ENTRYPOINT ["/work/nvidia-toolkit"]
4. tookkit容器运行日志重定向
实际的toolkit hook可执行文件的执行是放在nvidia-container-toolkit pod的/usr/local/nvidia/toolkit,这部分的文件拷贝和重写的细节可以看installToolkit()函数
增加日志重定向: 可以看到这里安装阶段nvidia-container-runtime-hook实际的可执行文件已经被拷贝到nvidia-container-runtime-hook.real, 这里的/usr/local/nvidia/toolkit/nvidia-container-runtime-hook 只是是个脚本话的wrapper。 这里我加了一些日志,后续每次runc hook 执行的时候都会日志都会重定向到 /tmp/ericlog这个宿主机上的零时目录。 这样,编译调试hook时增加的一些日志均可以查看了
cat /usr/local/nvidia/toolkit/nvidia-container-runtime-hook
#! /bin/sh
date >>/tmp/ericlog #新增,打印时间
pwd >>/tmp/ericlog #新增,可打印容器文件目录/run/containers/storage/overlay-containers/579f375272c02c4a746f720c3bade063778e79f6078b2be4ee084cd080c327a2/userdata
echo "$@" >>/tmp/ericlog #新增,可打印hook附带参数 prestart
PATH=/usr/local/nvidia/toolkit:$PATH \
/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real \
-config "/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml" \
"$@" >> /tmp/ericlog # 新增,可将nvidia-container-runtime-hook代码中标准输出打印到日志文件