MLOps 简介

125 阅读4分钟

一、MLOps

1.1 定义

MLOps (Machine Learning Operations) is a paradigm, including aspects like best practices, sets of concepts, as well as a development culture when it comes to the end-to-end conceptualization, implementation, monitoring, deployment, and scalability of machine learning products. Most of all, it is an engineering practice that leverages three contributing disciplines: machine learning, software engineering (especially DevOps), and data engineering. MLOps is aimed at productionizing machine learning systems by bridging the gap between development (Dev) and operations (Ops). Essentially, MLOps aims to facilitate the creation of machine learning products by leveraging these principles: CI/CD automation, workflow orchestration, reproducibility; versioning of data, model, and code; collaboration; continuous ML training and evaluation; ML metadata tracking and logging; continuous monitoring; and feedback loops.

MLOps = Machine Learning + DevOps

1.2 目标

打通Machine Learning从实验到生产中的各个环节,缩短机器学习从实验到生产的周期,自动化机器学习生产过程中的手动流程,实现端到端的全自动机器学习生产链路。

1.3 理念

  • CI / CD automation: It carries out the build, test, delivery, and deploy steps, provides fast feedback to developers.
  • Workflow orchestration: It coordinates the tasks of ML workflow pipeline according to DAGs.
  • Reproducibility: It is the ability to reproduce an ML experiment and obtain the exact same results.
  • Versioning: It ****ensures the versioning of data, model, and code to enable not only reproducibility, but also traceability
  • Collaboration: It ensures the possibility to work collaboratively on data, model, and code (different roles).
  • Continuous ML training & evaluation: It ****means periodic retraining of the ML model based on new feature data.
  • ML ****metadata ****tracing /logging: Metadata is tracked and logged for each orchestrated ML workflow task.
  • Continuous monitoring: It implies the periodic assessment of data, model, code, infrastructure resources, and model serving performance (e.g., prediction accuracy) to detect potential errors or changes that influence product quality.
  • Feedback loops: Multiple feedback loops are required to integrate insights from the quality assessment step into the development or engineering process.

1.4 组件

  • CI / CD Compponet(P1, P6, P9): It ensures continuous integration, continuous delivery, and continuous deployment. Eg: Jenkins, Gitlab ****CI , SCM , and so on.
  • Source Code Repository(P4, P5): It ensures code storing and versioning. Eg: Bitbucket, Gitlab, GitHub and so on.
  • Workflow Orchestration Component(P2, P3, P6): It coordinates the tasks of ML workflow pipeline according to DAGs. Eg: Apache Airflow, Kubeflow Pipelines, Luigi, AWS SageMaker Pipelines, Byteflow and so on.
  • Feature Store System(P3, P4): It ensures central storage of commonly used features. Eg: Google Feast, Amazon AWS Feature Store, Tecton.ai, Hopswork.ai and so on.
  • Model Training infrastructure(P6): It provides foundational computation resources. Eg: local machine、cloud computation、Arnold ****and so on .
  • Model Registry(P3, P4): It stores centrally the trained ML models together with their metadata. Eg: MLflow, AWS SageMaker Model Registry, Microsoft Azure ML Model Registry, Neptune.ai and so on.
  • ML ****Metadata Stores(P4, P7): It allows for the tracking of various kinds of metadata.
  • Model Serving Component (P1): It helps to serve models. Eg: KServing of Kubeflow, TensorFlow serving, Triton, Ivory and so on.
  • Monitoring Component(P8, P9): It takes care of the continuous monitoring of the model serving performance. Additionally, monitoring of the ML infrastructure, CI/CD, and orchestration are required. Eg: Prometheus with Grafana, ELK stack, TensorBoard, cloud watch, Argos and so on.

1.5 角色

1.6 流程

二、业界实践

方案CI/CD ComponnentsSource Code RepositoryWorkflow Orchestration ComponentFeature Store SystemModel Training InfrasturctureModel RegistryML Metadata StoresModel Serving ComponentMonitoring Component
开源Kubeflowk8s/kubeflow Pipelinesfeatstpytorch/tensorflow/.../Yeskfserving/pytorch serving/seldon core/trition/bentoMLpromethues(不支持Performace Monitoring)
MLflow/MLflow Projects///MLflow ModelsMLflow ModelReistryMLflow Tracking//
商业WanDB////Hyperparameter TuningArtifactsExperiment Tracking/Yes
SageMakerYesYesSageMaker PipelineYesYesYesYesYesAmazon CloudWatch
字节Reckon::MLOps///FeatureStore/YesYesPilotLite/Groot/Laplace/MarineServiceHubJavis
DLSpaceGitLab CIGitlabbyteflow/Arnold建设中支持Ivory字节生态

2.1 Kubeflow

简介

The Kubeflow project is dedicated to making deployments of machine learning ( ML ) workflows on kubernetes simple, portable and scalable.

Kubeflow是google开源的基于kubernetes的MLOps解决方案,集成了比较多的工具,主要包含以下功能:

  • NoteBooks: 基于Jupyter notebooks 提供了一个基于Web的交互式开发环境。
  • Kubeflow Pipelines: 基于Argo实现了面向机器学习场景的工作流项目,提供机器学习流程的创建、编排调度和管理。
  • Model Training: 基于kubernetes operator机制集成了常见的机器学习框架,提供机器学习模型训练的调度和管理。
  • Model Serving : 基于TF ServingPyTorch ServingSeldon,提供机器学习模型在线部署相关功能。
  • AutoML: 支持并行搜索和分布式训练等。
  • Feature Store : 基于 feast 提供机器学习特征管理。

架构

2.2 WanDB

简介

WanDB是一款优秀机器学习模型训练分析跟踪工具,主要功能包括:

  • 实验管理。
  • 超参调优。
  • 实验可视化。
  • 模型+数据+代码 版本管理。

Demo

wandb.ai/wandb/wandb…

2.3 SageMaker

简介

SageMaker是Amazon完全托管的机器学习解决方案。主要包含的功能为:

docs.amazonaws.cn/sagemaker/l…

Demo

aws.amazon.com/cn/getting-…

2.4 其他商业实践

2.5 小结

业界关于MLOps的实践基本都大同小异,虽然侧重点各不相同,但是都围绕以下几个方面进行建设:

  • ML流程的自动化。
  • 体现MLOps中多角色的协作性。
  • 沉浸式、一站式、极致简洁高效的使用体验。
  • 追求ML流程的可复现、可追溯、可解释、可监控性。
  • 追求ML流程、工具、数据、模型、服务的复用性与资产化。

三、结论

MLOps是ML生产活动中的一套方法论,通过定义或提供一系列抽象的理念、方法、工具,给出了一套搭建、维护ML系统的最佳实践,来指导实际生产中ML系统的构建,直接目标是实现ML生产流程的全流程自动化,根本目标是为了提高ML生产质量,缩短交付周期,节约生产成本。