kubeflow是一个方便在kubernetes平台上布署机器学习工作流的项目,集成了很多用于机器学习的工具。按说工具的安装是没有什么可讲的,按着文档照做就行了,但是kubeflow组件较多并运行于kubernetes之上,且官方提供的安装工具并没有考虑到国内的网络问题,这就导致其安装变的稍微复杂,所以本文旨在提供一个稳定的国内安装方案,手动操作较多,让操作者能对安装过程有更细节的了解,并能举一反三,以后遇到类似网络问题(比如k8s的安装)也能想到思路。
这里的网络问题指:国内的网络环境无法直接访问Google官方镜像仓库,但云原生及周边项目很多由Google主导,比如kubernetes、kubeflow等,这些项目在布署时又依赖Google官方镜像仓库,这就给国内用户安装布署带来了困难。
kubeflow官方推荐了2类布署方式:
本文章中使用manifests的方式进行布署,官方的布署方式经历了三次变动,最初是基于YAML文件的布署方式,而后是基于ksonnet 的kfctl.sh 的部署方式,当前是基于kustomize 的kfctl 二进制安装工具的部署方式。
kfctl命令有如下子命令:
- build:用于构建生成kubeflow各组件的定义配置,为YAML格式文件。
- apply:此命令会先执行build,然后根据build的产出配置,将kubeflow布署到kubernetes集群上。
- delete:删除已经布署的kubeflow,使用--force-deletion选项为强制删除。
- alpha:
kfctl alpha help可以查看处于alpha状态的命令,如 set-image-name 命令,当用户无法访问gcr时,可通过 set-image-name 来修改镜像地址(我没有用过此功能,因为它目前还是alpha版,并且我没有测试国内镜像地址的稳定性和更新的及时性)。
安装前的先置条件
- 已经有可供使用的kubernetes集群,可以是腾讯云、阿里云托管的服务,也可以是自建的集群。
- 注意kubernetes和kubeflow版本的兼容性问题:github.com/kubeflow/we…
软件版本选择
kubeflow目前仍然是一个比较年轻的项目,建议选择版本时不要过于激进,选择aws测试过的版本即可。
- kubernetes: 1.18
- kubeflow: 1.2.0
忽略国内网络问题并直接安装kubeflow会发生什么?
我已经有可用的kubernetes集群,集群拥有一个节点:
$ kubectl get nodes -A
NAME STATUS ROLES AGE VERSION
10.0.5.43 Ready <none> 22h v1.20.6-tke.9
从github下载最新的kubeflow安装工具kfctl:
# github release 地址:https://github.com/kubeflow/kfctl/releases
$ wget https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_darwin.tar.gz
$ tar -xf kfctl_v1.2.0-0-gbc038f9_darwin.tar.gz
$ mv ./kfctl /usr/local/bin
$ kfctl version
kfctl v1.2.0-0-gbc038f9
下载自己所需版本的 manifests 文件包(这里以我司使用的发v1.2.0版本为例):
$ wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.2.0.tar.gz
$ tar -xf v1.2.0.tar.gz
查看安装kubeflow需要使用到的YAML文件(相应的文件有很多,适用于不同安装场景,我们使用kfctl_istio_dex.v1.2.0.yaml,也可对具体的文件内容根据需要进行编辑):
$ ls manifests-1.2.0/kfdef/
OWNERS kfctl_aws.yaml kfctl_ibm.v1.0.1.yaml kfctl_istio_dex.v1.2.0.yaml
README.md kfctl_aws_cognito.v1.0.0.yaml kfctl_ibm.v1.0.2.yaml kfctl_istio_dex.yaml
generic kfctl_aws_cognito.v1.0.1.yaml kfctl_ibm.v1.1.0.yaml kfctl_k8s_istio.v1.0.0.yaml
kfctl_anthos.v1.0.0.yaml kfctl_aws_cognito.v1.0.2.yaml kfctl_ibm.v1.2.0.yaml kfctl_k8s_istio.v1.0.1.yaml
kfctl_anthos.v1.0.1.yaml kfctl_aws_cognito.v1.1.0.yaml kfctl_ibm.yaml kfctl_k8s_istio.v1.0.2.yaml
kfctl_anthos.v1.0.2.yaml kfctl_aws_cognito.v1.2.0.yaml kfctl_ibm_dex_multi_user.v1.1.0.yaml kfctl_k8s_istio.v1.1.0.yaml
kfctl_anthos.yaml kfctl_aws_cognito.yaml kfctl_ibm_multi_user.v1.2.0.yaml kfctl_k8s_istio.v1.2.0.yaml
kfctl_aws.v1.0.0.yaml kfctl_azure.v1.1.0.yaml kfctl_ibm_multi_user.yaml kfctl_k8s_istio.yaml
kfctl_aws.v1.0.1.yaml kfctl_azure.v1.2.0.yaml kfctl_istio_dex.v1.0.0.yaml kfctl_openshift.v1.1.0.yaml
kfctl_aws.v1.0.2.yaml kfctl_azure_aad.v1.1.0.yaml kfctl_istio_dex.v1.0.1.yaml kfctl_openshift.v1.2.0.yaml
kfctl_aws.v1.1.0.yaml kfctl_azure_aad.v1.2.0.yaml kfctl_istio_dex.v1.0.2.yaml source
kfctl_aws.v1.2.0.yaml kfctl_ibm.v1.0.0.yaml kfctl_istio_dex.v1.1.0.yaml
配置kfctl运行所需的环境变量:
# 应用布署名称
$ export KF_NAME="a4x-kubeflow"
# manifests目录所在路径
$ export KF_DIR="/Users/alex/Workspace/study/ml/kubeflow"
# 完整配置文件路径
$ export CONFIG_FILE="/Users/alex/Workspace/study/ml/kubeflow/manifests-1.2.0/kfdef/kfctl_istio_dex.v1.2.0.yaml"
在kubernetes集群上创建存储类,并将其设置为defaut storage class:
kubeflow配置文件在应用时,会自动创建一些volume来存储数据,但配置文件中并未指定StorageClass名称,这就要求我们创建一个StorageClass并设置其为default。
$ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
kubeflow udisk.csi.ucloud.cn Delete WaitForFirstConsumer true 38m
$ kubectl patch storageclass kubeflow -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
详细文档说明请参考:kubernetes.io/zh/docs/tas…
在apiserver中配置trustworthy JWTs:
kubeflow依赖istio,而安装1.3.1版本以上的istio依赖此配置,在apiserver的配置文件中,需要追加如下行:
--service-account-signing-key-file=/etc/kubernetes/ssl/ca-key.pem
--service-account-issuer=kubernetes.default.svc
--api-audiences=kubernetes.default.svc
所以我的apiserver配置文件内容是:
# /etc/kubernetes/apiserver
......
--proxy-client-key-file=/etc/kubernetes/ssl/aggregator-key.pem \
--service-cluster-ip-range=172.17.0.0/16 \
--service-node-port-range=30000-32767" \
--service-account-signing-key-file=/etc/kubernetes/ssl/ca-key.pem \
--service-account-issuer=kubernetes.default.svc \
--api-audiences=kubernetes.default.svc
细节可以参考:imroc.cc/istio/troub…
apply配置到kubernetes集群来创建应用:
$ kfctl apply -V -f $CONFIG_FILE
下面来看kubernetes集群里相关应用的状态:
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-59b485c4cc-r9hdh 0/1 ImagePullBackOff 0 43m
cert-manager cert-manager-cainjector-5bb487bcd-h5gds 0/1 ImagePullBackOff 0 43m
cert-manager cert-manager-webhook-74b4bd9bcc-7gwpn 0/1 ContainerCreating 0 43m
istio-system cluster-local-gateway-84bb595449-swdv5 0/1 ImagePullBackOff 0 43m
istio-system istio-citadel-7f66ddfcfb-9ftnz 0/1 ImagePullBackOff 0 43m
istio-system istio-galley-7976dd55cd-c7xqd 0/1 ContainerCreating 0 43m
istio-system istio-ingressgateway-c79f9f6f-qkvqv 0/1 ContainerCreating 0 43m
istio-system istio-nodeagent-6z27f 0/1 ImagePullBackOff 0 43m
istio-system istio-nodeagent-9f2v8 0/1 ImagePullBackOff 0 43m
istio-system istio-nodeagent-g8c4k 0/1 ImagePullBackOff 0 43m
istio-system istio-nodeagent-hnvmm 0/1 ImagePullBackOff 0 43m
istio-system istio-nodeagent-rcrxm 0/1 ImagePullBackOff 0 43m
istio-system istio-nodeagent-ssj4f 0/1 ImagePullBackOff 0 43m
istio-system istio-pilot-7bd96d69d9-jbwb2 0/2 ContainerCreating 0 43m
istio-system istio-policy-66b5d9887c-k4rxl 0/2 ContainerCreating 0 43m
istio-system istio-security-post-install-release-1.3-latest-daily-gzqdj 0/1 ImagePullBackOff 0 43m
istio-system istio-sidecar-injector-56b6997f7d-9ql22 0/1 ContainerCreating 0 43m
istio-system istio-telemetry-856f7bcff4-2gk7v 0/2 ContainerCreating 0 43m
istio-system prometheus-65fdcbc857-xgxtx 0/1 ContainerCreating 0 43m
kube-system cloudprovider-ucloud-7d959cc87d-r9q4s 1/1 Running 0 4h34m
kube-system coredns-68599f8c7f-9stf9 1/1 Running 0 4h34m
kube-system coredns-68599f8c7f-q5m4p 1/1 Running 0 4h34m
kube-system csi-udisk-4ssjk 2/2 Running 0 4h33m
kube-system csi-udisk-5vhtn 2/2 Running 0 4h33m
kube-system csi-udisk-controller-0 5/5 Running 0 4h34m
kube-system csi-udisk-fktzc 2/2 Running 0 4h33m
kube-system csi-udisk-xd7z8 2/2 Running 0 4h33m
kube-system csi-udisk-xn7wb 2/2 Running 0 4h33m
kube-system csi-udisk-zpmx4 2/2 Running 0 4h33m
kube-system csi-ufile-c4fsj 2/2 Running 0 4h33m
kube-system csi-ufile-controller-0 4/4 Running 0 4h34m
kube-system csi-ufile-ctvfn 2/2 Running 0 4h33m
kube-system csi-ufile-nc47n 2/2 Running 0 4h33m
kube-system csi-ufile-td9fs 2/2 Running 0 4h33m
kube-system csi-ufile-wj9kl 2/2 Running 0 4h33m
kube-system csi-ufile-zxlsp 2/2 Running 0 4h33m
kube-system metrics-server-749544fd7b-52hr2 1/1 Running 0 4h34m
kube-system nvidia-device-plugin-daemonset-7hzm9 1/1 Running 0 4h33m
kube-system nvidia-device-plugin-daemonset-sjdmj 1/1 Running 0 4h33m
kube-system uk8s-kubectl-7585dc44f7-5ch9l 1/1 Running 0 4h34m
kubeflow application-controller-stateful-set-0 0/1 ImagePullBackOff 0 43m
可以发现很多pod的状态是ImagePullBackOff,这说明无法直接从gcr下载需要使用的镜像。
绕过网络问题,启动所有kubeflow服务
从上面的情况来看,无法从gcr下载镜像,服务无法正常启动,解决思路是查看启动失败Pod所需要的镜像,并在能访问gcr的机器上进行下载,下载完成后再上传到私有镜像仓库中,最后修改相关YAML文件镜像地址。
确定需要手动下载的镜像列表:
# 查看所有Pod
$ kubectl get pods -A
# 查看指定Pod使用的镜像
$ kubectl describe pod <podName> -n <namespace>
从相关Pod查到使用的镜像后下到本地,并进行retag操作,并推到自己的私有仓库中:
# 举单个例子说明
$ docker pull quay.io/jetstack/cert-manager-controller:v0.11.0
$ docker tag quay.io/jetstack/cert-manager-controller:v0.11.0 hub.service.xxcloud.cn/xxx-kubeflow/jetstack/cert-manager-controller:v0.11.0
$ docker push hub.service.xxcloud.cn/xxx-kubeflow/jetstack/cert-manager-controller:v0.11.0
使用vscode对manifests中所有内容进行替换(command+shift+h全局替换),把gcr.io及quay.io替换成自已私有的源地址。
修改完成后,可以把改动后的manifests放到自己的github仓库中,并生成release包:
然后复制包地址,写到kfctl_istio_dex.v1.2.0.yaml文件中的URI位置,在执行kfctl apply -V -f kfctl_istio_dex.v1.2.0.yaml时,会自动拉取压缩包并解压,如果没有改这里的URI,那么URI的默认内容就会复盖上面用户替换过的内容,导致换源不成功:
# kfdef/kfctl_istio_dex.v1.2.0.yaml
....................
- kustomizeConfig:
repoRef:
name: manifests
path: kfserving/installs/generic
name: kfserving
repos:
- name: manifests
uri: https://github.com/AlexGuoMe/manifests/archive/refs/tags/0.5.tar.gz
重新应用kfctl_istio_dex.v1.2.0.yaml文件:
上面已经应用过一次YAML文件,但是文件中的image信息还是老的,替换镜像源地址后,需要重新apply。
$ kfctl apply -V -f $CONFIG_FILE
至此所有的镜像基本都能启动了,如果个别有问题,则需要Pod或Container的报错修改处理即可。
验证安装成功
当所有的Pod都Ready后,问题就不大了,下面通过port-forward命令先临时打开本地端口,打开并测试系统。
$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
修改用户名密码,我把admin@kubeflow.org改成了notadmin:
# 导出dex-config.yaml并在里面进行修改即可
$ kubectl get configmap dex -n auth -o jsonpath='{.data.config.yaml}' > dex-config.yaml
$ kubectl create configmap dex --from-file=config.yaml=dex-config.yaml -n auth --dry-run -oyaml | kubectl apply -f -
$ kubectl rollout restart deployment dex -n auth
配置LDAP登录
LDAP配置文件如下:
# dex-config-final.yaml
issuer: http://dex.auth.svc.cluster.local:5556/dex
storage:
type: kubernetes
config:
inCluster: true
web:
http: 0.0.0.0:5556
logger:
level: "debug"
format: text
oauth2:
skipApprovalScreen: true
enablePasswordDB: true
staticPasswords:
- email: notadmin
hash: $2y$12$ruoM7FqXrpVgaol44eRZW.4HWS8SAvg6KYVVSCIwKQPBmTpCm.EeO
username: admin
userID: 08a8684b-db88-4b73-90a9-3cd1661f5466
staticClients:
- id: kubeflow-oidc-authservice
redirectURIs: ["/login/oidc"]
name: 'Dex Login Application'
secret: pUBnBOY80SnXgjibTYM9ZWNzY2xreNGQok
connectors:
- type: ldap
# Required field for connector id.
id: ldap
# Required field for connector name.
name: LDAP
config:
# Host and optional port of the LDAP server in the form "host:port".
# If the port is not supplied, it will be guessed based on "insecureNoSSL",
# and "startTLS" flags. 389 for insecure or StartTLS connections, 636
# otherwise.
host: x.x.x.x:port
# Following field is required if the LDAP host is not using TLS (port 389).
# Because this option inherently leaks passwords to anyone on the same network
# as dex, THIS OPTION MAY BE REMOVED WITHOUT WARNING IN A FUTURE RELEASE.
#
insecureNoSSL: true
# If a custom certificate isn't provide, this option can be used to turn off
# TLS certificate checks. As noted, it is insecure and shouldn't be used outside
# of explorative phases.
#
insecureSkipVerify: true
# When connecting to the server, connect using the ldap:// protocol then issue
# a StartTLS command. If unspecified, connections will use the ldaps:// protocol
#
startTLS: false
# Path to a trusted root certificate file. Default: use the host's root CA.
# rootCA: /etc/dex/ldap.ca
# clientCert: /etc/dex/ldap.cert
# clientKey: /etc/dex/ldap.key
# A raw certificate file can also be provided inline.
# rootCAData: ( base64 encoded PEM file )
# The DN and password for an application service account. The connector uses
# these credentials to search for users and groups. Not required if the LDAP
# server provides access for anonymous auth.
# Please note that if the bind password contains a `$`, it has to be saved in an
# environment variable which should be given as the value to `bindPW`.
bindDN: cn=internal-auth,ou=applications,dc=xxx,dc=ai
bindPW: password
# The attribute to display in the provided password prompt. If unset, will
# display "Username"
usernamePrompt: username
# User search maps a username and password entered by a user to a LDAP entry.
userSearch:
# BaseDN to start the search from. It will translate to the query
# "(&(objectClass=person)(uid=<username>))".
baseDN: dc=xxx,dc=ai
# Optional filter to apply when searching the directory.
# filter: "(objectClass=inetOrgPerson)"
# username attribute used for comparing user entries. This will be translated
# and combined with the other filter as "(<attr>=<username>)".
username: cn
# The following three fields are direct mappings of attributes on the user entry.
# String representation of the user.
idAttr: cn
# Required. Attribute to map to Email.
emailAttr: mail
# Maps to display name of users. No default value.
nameAttr: givenName
# Group search queries for groups given a user entry.
groupSearch:
# BaseDN to start the search from. It will translate to the query
# "(&(objectClass=group)(member=<user uid>))".
baseDN: dc=xxx,dc=ai
# Optional filter to apply when searching the directory.
filter: "(objectClass=groupOfNames)"
# Following two fields are used to match a user to a group. It adds an additional
# requirement to the filter that an attribute in the group must match the user's
# attribute value.
userAttr: DN
groupAttr: member
# Represents group name.
nameAttr: cn
# 更新配置到kubernets
$ kubectl create configmap dex --from-file=config.yaml=dex-config-final.yaml -n auth --dry-run -oyaml | kubectl apply -f -
$ kubectl rollout restart deployment dex -n auth
kubeflow组件众多,在国外网络环境布署应该是比较简单的一件事,但是在内网环境就会遇到很多坑,这就需要对kubernetes的工作方式有一定的了解,有些很麻烦的事,也未尝不是好事,Just enjoy it and study in trouble.
参考文献: