国内K8S环境安装kubeflow

3,526 阅读7分钟

kubeflow是一个方便在kubernetes平台上布署机器学习工作流的项目,集成了很多用于机器学习的工具。按说工具的安装是没有什么可讲的,按着文档照做就行了,但是kubeflow组件较多并运行于kubernetes之上,且官方提供的安装工具并没有考虑到国内的网络问题,这就导致其安装变的稍微复杂,所以本文旨在提供一个稳定的国内安装方案,手动操作较多,让操作者能对安装过程有更细节的了解,并能举一反三,以后遇到类似网络问题(比如k8s的安装)也能想到思路。

这里的网络问题指:国内的网络环境无法直接访问Google官方镜像仓库,但云原生及周边项目很多由Google主导,比如kubernetes、kubeflow等,这些项目在布署时又依赖Google官方镜像仓库,这就给国内用户安装布署带来了困难。

kubeflow官方推荐了2类布署方式:

本文章中使用manifests的方式进行布署,官方的布署方式经历了三次变动,最初是基于YAML文件的布署方式,而后是基于ksonnet 的kfctl.sh 的部署方式,当前是基于kustomize 的kfctl 二进制安装工具的部署方式。

kfctl命令有如下子命令:

  • build:用于构建生成kubeflow各组件的定义配置,为YAML格式文件。
  • apply:此命令会先执行build,然后根据build的产出配置,将kubeflow布署到kubernetes集群上。
  • delete:删除已经布署的kubeflow,使用--force-deletion选项为强制删除。
  • alpha:kfctl alpha help可以查看处于alpha状态的命令,如 set-image-name 命令,当用户无法访问gcr时,可通过 set-image-name 来修改镜像地址(我没有用过此功能,因为它目前还是alpha版,并且我没有测试国内镜像地址的稳定性和更新的及时性)。

安装前的先置条件

  • 已经有可供使用的kubernetes集群,可以是腾讯云、阿里云托管的服务,也可以是自建的集群。
  • 注意kubernetes和kubeflow版本的兼容性问题:github.com/kubeflow/we…

软件版本选择

kubeflow目前仍然是一个比较年轻的项目,建议选择版本时不要过于激进,选择aws测试过的版本即可。

  • kubernetes: 1.18
  • kubeflow: 1.2.0

忽略国内网络问题并直接安装kubeflow会发生什么?

我已经有可用的kubernetes集群,集群拥有一个节点:

$ kubectl get nodes -A
NAME        STATUS   ROLES    AGE   VERSION
10.0.5.43   Ready    <none>   22h   v1.20.6-tke.9

从github下载最新的kubeflow安装工具kfctl:

# github release 地址:https://github.com/kubeflow/kfctl/releases
$ wget https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_darwin.tar.gz
$ tar -xf kfctl_v1.2.0-0-gbc038f9_darwin.tar.gz
$ mv ./kfctl /usr/local/bin

$ kfctl version
kfctl v1.2.0-0-gbc038f9

下载自己所需版本的 manifests 文件包(这里以我司使用的发v1.2.0版本为例):

$ wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.2.0.tar.gz
$ tar -xf v1.2.0.tar.gz

查看安装kubeflow需要使用到的YAML文件(相应的文件有很多,适用于不同安装场景,我们使用kfctl_istio_dex.v1.2.0.yaml,也可对具体的文件内容根据需要进行编辑):

$ ls manifests-1.2.0/kfdef/
OWNERS                               kfctl_aws.yaml                       kfctl_ibm.v1.0.1.yaml                kfctl_istio_dex.v1.2.0.yaml
README.md                            kfctl_aws_cognito.v1.0.0.yaml        kfctl_ibm.v1.0.2.yaml                kfctl_istio_dex.yaml
generic                              kfctl_aws_cognito.v1.0.1.yaml        kfctl_ibm.v1.1.0.yaml                kfctl_k8s_istio.v1.0.0.yaml
kfctl_anthos.v1.0.0.yaml             kfctl_aws_cognito.v1.0.2.yaml        kfctl_ibm.v1.2.0.yaml                kfctl_k8s_istio.v1.0.1.yaml
kfctl_anthos.v1.0.1.yaml             kfctl_aws_cognito.v1.1.0.yaml        kfctl_ibm.yaml                       kfctl_k8s_istio.v1.0.2.yaml
kfctl_anthos.v1.0.2.yaml             kfctl_aws_cognito.v1.2.0.yaml        kfctl_ibm_dex_multi_user.v1.1.0.yaml kfctl_k8s_istio.v1.1.0.yaml
kfctl_anthos.yaml                    kfctl_aws_cognito.yaml               kfctl_ibm_multi_user.v1.2.0.yaml     kfctl_k8s_istio.v1.2.0.yaml
kfctl_aws.v1.0.0.yaml                kfctl_azure.v1.1.0.yaml              kfctl_ibm_multi_user.yaml            kfctl_k8s_istio.yaml
kfctl_aws.v1.0.1.yaml                kfctl_azure.v1.2.0.yaml              kfctl_istio_dex.v1.0.0.yaml          kfctl_openshift.v1.1.0.yaml
kfctl_aws.v1.0.2.yaml                kfctl_azure_aad.v1.1.0.yaml          kfctl_istio_dex.v1.0.1.yaml          kfctl_openshift.v1.2.0.yaml
kfctl_aws.v1.1.0.yaml                kfctl_azure_aad.v1.2.0.yaml          kfctl_istio_dex.v1.0.2.yaml          source
kfctl_aws.v1.2.0.yaml                kfctl_ibm.v1.0.0.yaml                kfctl_istio_dex.v1.1.0.yaml

配置kfctl运行所需的环境变量:

# 应用布署名称
$ export KF_NAME="a4x-kubeflow"
# manifests目录所在路径
$ export KF_DIR="/Users/alex/Workspace/study/ml/kubeflow"
# 完整配置文件路径
$ export CONFIG_FILE="/Users/alex/Workspace/study/ml/kubeflow/manifests-1.2.0/kfdef/kfctl_istio_dex.v1.2.0.yaml"

在kubernetes集群上创建存储类,并将其设置为defaut storage class:

kubeflow配置文件在应用时,会自动创建一些volume来存储数据,但配置文件中并未指定StorageClass名称,这就要求我们创建一个StorageClass并设置其为default。

$ kubectl get sc
NAME                 PROVISIONER           RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
kubeflow             udisk.csi.ucloud.cn   Delete          WaitForFirstConsumer   true                   38m

$ kubectl patch storageclass kubeflow -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

详细文档说明请参考:kubernetes.io/zh/docs/tas…

在apiserver中配置trustworthy JWTs:

kubeflow依赖istio,而安装1.3.1版本以上的istio依赖此配置,在apiserver的配置文件中,需要追加如下行:

--service-account-signing-key-file=/etc/kubernetes/ssl/ca-key.pem
--service-account-issuer=kubernetes.default.svc
--api-audiences=kubernetes.default.svc

所以我的apiserver配置文件内容是:

# /etc/kubernetes/apiserver

......
--proxy-client-key-file=/etc/kubernetes/ssl/aggregator-key.pem \
--service-cluster-ip-range=172.17.0.0/16 \
--service-node-port-range=30000-32767" \
--service-account-signing-key-file=/etc/kubernetes/ssl/ca-key.pem \
--service-account-issuer=kubernetes.default.svc \
--api-audiences=kubernetes.default.svc

细节可以参考:imroc.cc/istio/troub…

apply配置到kubernetes集群来创建应用:

$ kfctl apply -V -f $CONFIG_FILE

下面来看kubernetes集群里相关应用的状态:

$ kubectl get pods -A
NAMESPACE      NAME                                                         READY   STATUS              RESTARTS   AGE
cert-manager   cert-manager-59b485c4cc-r9hdh                                0/1     ImagePullBackOff    0          43m
cert-manager   cert-manager-cainjector-5bb487bcd-h5gds                      0/1     ImagePullBackOff    0          43m
cert-manager   cert-manager-webhook-74b4bd9bcc-7gwpn                        0/1     ContainerCreating   0          43m
istio-system   cluster-local-gateway-84bb595449-swdv5                       0/1     ImagePullBackOff    0          43m
istio-system   istio-citadel-7f66ddfcfb-9ftnz                               0/1     ImagePullBackOff    0          43m
istio-system   istio-galley-7976dd55cd-c7xqd                                0/1     ContainerCreating   0          43m
istio-system   istio-ingressgateway-c79f9f6f-qkvqv                          0/1     ContainerCreating   0          43m
istio-system   istio-nodeagent-6z27f                                        0/1     ImagePullBackOff    0          43m
istio-system   istio-nodeagent-9f2v8                                        0/1     ImagePullBackOff    0          43m
istio-system   istio-nodeagent-g8c4k                                        0/1     ImagePullBackOff    0          43m
istio-system   istio-nodeagent-hnvmm                                        0/1     ImagePullBackOff    0          43m
istio-system   istio-nodeagent-rcrxm                                        0/1     ImagePullBackOff    0          43m
istio-system   istio-nodeagent-ssj4f                                        0/1     ImagePullBackOff    0          43m
istio-system   istio-pilot-7bd96d69d9-jbwb2                                 0/2     ContainerCreating   0          43m
istio-system   istio-policy-66b5d9887c-k4rxl                                0/2     ContainerCreating   0          43m
istio-system   istio-security-post-install-release-1.3-latest-daily-gzqdj   0/1     ImagePullBackOff    0          43m
istio-system   istio-sidecar-injector-56b6997f7d-9ql22                      0/1     ContainerCreating   0          43m
istio-system   istio-telemetry-856f7bcff4-2gk7v                             0/2     ContainerCreating   0          43m
istio-system   prometheus-65fdcbc857-xgxtx                                  0/1     ContainerCreating   0          43m
kube-system    cloudprovider-ucloud-7d959cc87d-r9q4s                        1/1     Running             0          4h34m
kube-system    coredns-68599f8c7f-9stf9                                     1/1     Running             0          4h34m
kube-system    coredns-68599f8c7f-q5m4p                                     1/1     Running             0          4h34m
kube-system    csi-udisk-4ssjk                                              2/2     Running             0          4h33m
kube-system    csi-udisk-5vhtn                                              2/2     Running             0          4h33m
kube-system    csi-udisk-controller-0                                       5/5     Running             0          4h34m
kube-system    csi-udisk-fktzc                                              2/2     Running             0          4h33m
kube-system    csi-udisk-xd7z8                                              2/2     Running             0          4h33m
kube-system    csi-udisk-xn7wb                                              2/2     Running             0          4h33m
kube-system    csi-udisk-zpmx4                                              2/2     Running             0          4h33m
kube-system    csi-ufile-c4fsj                                              2/2     Running             0          4h33m
kube-system    csi-ufile-controller-0                                       4/4     Running             0          4h34m
kube-system    csi-ufile-ctvfn                                              2/2     Running             0          4h33m
kube-system    csi-ufile-nc47n                                              2/2     Running             0          4h33m
kube-system    csi-ufile-td9fs                                              2/2     Running             0          4h33m
kube-system    csi-ufile-wj9kl                                              2/2     Running             0          4h33m
kube-system    csi-ufile-zxlsp                                              2/2     Running             0          4h33m
kube-system    metrics-server-749544fd7b-52hr2                              1/1     Running             0          4h34m
kube-system    nvidia-device-plugin-daemonset-7hzm9                         1/1     Running             0          4h33m
kube-system    nvidia-device-plugin-daemonset-sjdmj                         1/1     Running             0          4h33m
kube-system    uk8s-kubectl-7585dc44f7-5ch9l                                1/1     Running             0          4h34m
kubeflow       application-controller-stateful-set-0                        0/1     ImagePullBackOff    0          43m

可以发现很多pod的状态是ImagePullBackOff,这说明无法直接从gcr下载需要使用的镜像。

绕过网络问题,启动所有kubeflow服务

从上面的情况来看,无法从gcr下载镜像,服务无法正常启动,解决思路是查看启动失败Pod所需要的镜像,并在能访问gcr的机器上进行下载,下载完成后再上传到私有镜像仓库中,最后修改相关YAML文件镜像地址。

确定需要手动下载的镜像列表:

# 查看所有Pod
$ kubectl get pods -A

# 查看指定Pod使用的镜像
$ kubectl describe pod <podName> -n <namespace>

从相关Pod查到使用的镜像后下到本地,并进行retag操作,并推到自己的私有仓库中:

# 举单个例子说明
$ docker pull quay.io/jetstack/cert-manager-controller:v0.11.0
$ docker tag quay.io/jetstack/cert-manager-controller:v0.11.0 hub.service.xxcloud.cn/xxx-kubeflow/jetstack/cert-manager-controller:v0.11.0
$ docker push hub.service.xxcloud.cn/xxx-kubeflow/jetstack/cert-manager-controller:v0.11.0

使用vscode对manifests中所有内容进行替换(command+shift+h全局替换),把gcr.io及quay.io替换成自已私有的源地址。

修改完成后,可以把改动后的manifests放到自己的github仓库中,并生成release包:

然后复制包地址,写到kfctl_istio_dex.v1.2.0.yaml文件中的URI位置,在执行kfctl apply -V -f kfctl_istio_dex.v1.2.0.yaml时,会自动拉取压缩包并解压,如果没有改这里的URI,那么URI的默认内容就会复盖上面用户替换过的内容,导致换源不成功:

# kfdef/kfctl_istio_dex.v1.2.0.yaml
....................

  - kustomizeConfig:
      repoRef:
        name: manifests
        path: kfserving/installs/generic
    name: kfserving
  repos:
  - name: manifests
    uri: https://github.com/AlexGuoMe/manifests/archive/refs/tags/0.5.tar.gz

重新应用kfctl_istio_dex.v1.2.0.yaml文件:

上面已经应用过一次YAML文件,但是文件中的image信息还是老的,替换镜像源地址后,需要重新apply。

$ kfctl apply -V -f $CONFIG_FILE

至此所有的镜像基本都能启动了,如果个别有问题,则需要Pod或Container的报错修改处理即可。

验证安装成功

当所有的Pod都Ready后,问题就不大了,下面通过port-forward命令先临时打开本地端口,打开并测试系统。

$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

修改用户名密码,我把admin@kubeflow.org改成了notadmin:

# 导出dex-config.yaml并在里面进行修改即可
$ kubectl get configmap dex -n auth -o jsonpath='{.data.config.yaml}' > dex-config.yaml
$ kubectl create configmap dex --from-file=config.yaml=dex-config.yaml -n auth --dry-run -oyaml | kubectl apply -f -
$ kubectl rollout restart deployment dex -n auth

配置LDAP登录

LDAP配置文件如下:

# dex-config-final.yaml

issuer: http://dex.auth.svc.cluster.local:5556/dex
storage:
  type: kubernetes
  config:
    inCluster: true
web:
  http: 0.0.0.0:5556
logger:
  level: "debug"
  format: text
oauth2:
  skipApprovalScreen: true
enablePasswordDB: true
staticPasswords:
- email: notadmin
  hash: $2y$12$ruoM7FqXrpVgaol44eRZW.4HWS8SAvg6KYVVSCIwKQPBmTpCm.EeO
  username: admin
  userID: 08a8684b-db88-4b73-90a9-3cd1661f5466
staticClients:
- id: kubeflow-oidc-authservice
  redirectURIs: ["/login/oidc"]
  name: 'Dex Login Application'
  secret: pUBnBOY80SnXgjibTYM9ZWNzY2xreNGQok
connectors:
- type: ldap
  # Required field for connector id.
  id: ldap
  # Required field for connector name.
  name: LDAP
  config:
    # Host and optional port of the LDAP server in the form "host:port".
    # If the port is not supplied, it will be guessed based on "insecureNoSSL",
    # and "startTLS" flags. 389 for insecure or StartTLS connections, 636
    # otherwise.
    host: x.x.x.x:port
  
    # Following field is required if the LDAP host is not using TLS (port 389).
    # Because this option inherently leaks passwords to anyone on the same network
    # as dex, THIS OPTION MAY BE REMOVED WITHOUT WARNING IN A FUTURE RELEASE.
    #
    insecureNoSSL: true
  
    # If a custom certificate isn't provide, this option can be used to turn off
    # TLS certificate checks. As noted, it is insecure and shouldn't be used outside
    # of explorative phases.
    #
    insecureSkipVerify: true
  
    # When connecting to the server, connect using the ldap:// protocol then issue
    # a StartTLS command. If unspecified, connections will use the ldaps:// protocol
    #
    startTLS: false
  
    # Path to a trusted root certificate file. Default: use the host's root CA.
    # rootCA: /etc/dex/ldap.ca
    # clientCert: /etc/dex/ldap.cert
    # clientKey: /etc/dex/ldap.key
  
    # A raw certificate file can also be provided inline.
    # rootCAData: ( base64 encoded PEM file )
  
    # The DN and password for an application service account. The connector uses
    # these credentials to search for users and groups. Not required if the LDAP
    # server provides access for anonymous auth.
    # Please note that if the bind password contains a `$`, it has to be saved in an
    # environment variable which should be given as the value to `bindPW`.
    bindDN: cn=internal-auth,ou=applications,dc=xxx,dc=ai
    bindPW: password
  
    # The attribute to display in the provided password prompt. If unset, will
    # display "Username"
    usernamePrompt: username
  
    # User search maps a username and password entered by a user to a LDAP entry.
    userSearch:
      # BaseDN to start the search from. It will translate to the query
      # "(&(objectClass=person)(uid=<username>))".
      baseDN: dc=xxx,dc=ai
      # Optional filter to apply when searching the directory.
      # filter: "(objectClass=inetOrgPerson)"
  
      # username attribute used for comparing user entries. This will be translated
      # and combined with the other filter as "(<attr>=<username>)".
      username: cn
      # The following three fields are direct mappings of attributes on the user entry.
      # String representation of the user.
      idAttr: cn
      # Required. Attribute to map to Email.
      emailAttr: mail
      # Maps to display name of users. No default value.
      nameAttr: givenName
  
    # Group search queries for groups given a user entry.
    groupSearch:
      # BaseDN to start the search from. It will translate to the query
      # "(&(objectClass=group)(member=<user uid>))".
      baseDN: dc=xxx,dc=ai
      # Optional filter to apply when searching the directory.
      filter: "(objectClass=groupOfNames)"
  
      # Following two fields are used to match a user to a group. It adds an additional
      # requirement to the filter that an attribute in the group must match the user's
      # attribute value.
      userAttr: DN
      groupAttr: member
  
      # Represents group name.
      nameAttr: cn
# 更新配置到kubernets
$ kubectl create configmap dex --from-file=config.yaml=dex-config-final.yaml -n auth --dry-run -oyaml | kubectl apply -f -
$ kubectl rollout restart deployment dex -n auth

kubeflow组件众多,在国外网络环境布署应该是比较简单的一件事,但是在内网环境就会遇到很多坑,这就需要对kubernetes的工作方式有一定的了解,有些很麻烦的事,也未尝不是好事,Just enjoy it and study in trouble.

参考文献: